IBM NS1 Connect

 View Only

Orb: The foundation of network observability at NS1

By Nathanael Jean-Francois posted Tue October 03, 2023 11:32 AM


I recently provided a peek behind the curtain of NS1’s internal operations to show how we leverage Netbox for network automation. In this post, I’ll show how Orb - an open-source network observability tool we developed internally - keeps the NS1 Managed DNS network at optimal performance 24/7/365.

Observability on the NS1 Network

The Orb project began with a simple question: how can NS1 get the visibility it needs to protect, operate, debug, and improve performance and reliability on its global network of 26 Points of Presence (PoPs)?

Looking at available solutions, the NS1 team found that many of them didn’t collect the sort of deep network data that would be useful for a DNS platform like ours. Even if they did, the cost of collecting, storing, and processing that data would be prohibitive, given the scale of NS1’s operations. So the decision was made to build a tool ourselves.

The initial result was pktvisor - an open source packet analyzer designed for high volume, information-dense network data streams which is able to extract actionable insights directly from the edge. Over time, the project expanded to include agent orchestration, a control plane, and support for new data streams and analyzers. All of these are now available in Orb, a dynamic edge observability platform.

Today, NS1 uses Orb and pktvisor to proactively mitigate any threat to network performance - whether it’s a DDoS attack, a misconfiguration, resource misalignments, or a range of other potential challenges.

DDoS Attacks and Root Cause Analysis

In our line of work, DDoS attacks are part of doing business. Orb is a critical piece of NS1’s highly resilient network architecture, designed to minimize the impact of DDoS events in line with our 100% uptime SLA.

When we see a spike in traffic, Orb helps us quickly identify the cause and points us toward potential solutions. More often than not, these are from DDoS attacks. But Orb is just as important in showing us when something other than DDoS activity is to blame.

In a recent incident, we noticed a spike in activity that started to deprecate performance in certain geographies. While our initial instinct was to brand it as a DDoS attack, a quick look at the data in Orb dispelled that notion.

Looking into DNS response traffic, Orb showed that the spike in activity was only hitting a single domain. DDoS attacks usually hit multiple subdomains (real and non-existent) to force the exhaustion of resources connected to servicing that target. The lack of this kind of behavior was the first indication that the source of the issue was not a standard DDoS attack.

Then we used Orb to look at the response codes returned on queries. We saw a huge decline in OK (which in DNS manifests as NOERROR) responses and a parallel rise in SRVFAIL responses. This gave us the intelligence we needed to rule out a DDoS attack as the root cause. Instead, the Orb data pointed us in a different direction, eventually leading to the real root cause and helping us to mitigate the issue within minutes.

Having the Orb data at our fingertips allowed NS1 to identify and correct the error in a matter of minutes, minimizing the impact to our customers and providing clarity on how to prevent similar incidents in the future.

Network Performance Monitoring

Orb is part of my daily routine. Every morning, I look at our dashboard and glance through the traffic overnight and over the last several days. Most of the time, I see smooth, flattened curves that indicate broader patterns of “internet weather”. When a spike in activity catches my eye, Orb gives me the data I need to investigate the issue further.

The dashboard I find most useful breaks down NS1’s traffic by geography. When I see spikes in activity coming from a particular region, that usually indicates DDoS activity or some kind of local misconfiguration that impacts network performance for a particular customer.

Orb data shows us low-level DDoS activity that our customers may not notice. Since NS1 is set up to perform well even during traffic surges, smaller attacks often stay under the radar because there’s no tangible impact on performance. Orb picks up these smaller indicators, giving both NS1 and its customers the data necessary to take defensive steps.

I was going through this geography dashboard recently and found an elevated pattern of NXDOMAIN responses in a single PoP. Digging deeper into the data Orb collected, I found that a single CNAME at a single customer of ours wasn’t configured correctly. It pointed to the wrong place, resulting in invalid queries to the intended authoritative nameserver.

Resourcing and performance improvement

NS1’s global network requires constant care and feeding. To maintain and improve the performance of our systems, we’re always looking at underlying data patterns to determine areas that need more resources and work. Orb data streams are a key part of this ongoing effort.

Geographic data is among the most important information we use to plan resource allocations. When we see long-term traffic rising in specific regions, that often leads us to deploy more capacity in those areas - from bolstering our existing presence to standing up completely new PoPs.

NS1 also uses Orb data streams to analyze traffic routing patterns from back-end service partners such as internet service providers. Looking at the way DNS traffic comes into NS1 and is transmitted back to end users can help us improve baseline performance across the board.

As part of our constant drive to improve response times across our network, we frequently run development sprints to analyze data from specific PoPs and implement changes accordingly. This often involves combining Orb data with performance metrics from other platforms like Catchpoint, which measure latency. In a recent sprint for our Singapore PoP, the data pointed us toward changes that improved performance by an average of more than 20ms per query for a subset of recursive resolvers in the region - a significant change.

Enhancing network observability

Here at NS1, we use the data and insights from Orb so often that it’s become a critical part of our operations. That’s why we’re such huge cheerleaders for Orb - it delivers so much value for us every day in terms of time saved, the efficiency of resource allocation, and system reliability. The best part: Orb is free to run, both as a self-hosted tool or as a cloud service. You can get started with Orb in under ten minutes - a no-brainer for any network team looking for actionable data.

And it gets even better! NS1 is now offering its customers the same Orb-backed visibility we’ve been leveraging internally. With the new DNS Insights feature of both Managed and Dedicated DNS, customers can comb through a curated data feed to quickly locate configuration errors that impact performance. Our customers are already starting to rave about this new feature - comments like “this is the data we’ve been wanting for years” are common. If you’re an NS1 customer, ask your customer success rep today for more information.