Classifying High Volume DNS Traffic In Near Real-Time

By Prachi Gupta, Vijay Annant Kushwaha, Harihara Natarajan

DNS Servers are a crucial part of the DNS Infrastructure. Within a telecom operator environment, DNS traffic data offers a rich source of information that can improve the user experience. This could range from streaming specific types of content based on device types to choosing faster servers for update locations. With the new Work-From-Home situation, there are needs for newer types of services for example, can you block my kids from watching NetFlix from 9:00 am to 5:00 pm on a weekday? Such services can be enabled by real-time analysis of DNS traffic.

This article is about how we worked with one of our telecom customers who is embarked upon a new journey to monetize the DNS information to design and test a system that can classify high volumes of DNS traffic in near real-time.

Fig 1: DNSTap for DNS Data capture

A telecom network operator came to us with a challenge their Network lab was finding it difficult to address. They deal with peak traffic of 1.2 million DNS queries per second and they were trying to classify their DNS traffic into a set of known categories in near real-time.

In our analysis, the main issues in their approach seemed to be:

1. Lack of robust and real-time processing

Using Wiretracing or log files to collect the information and derive insights for monetization led to delays and inevitable process issues. The tracing was a huge burden and required regular maintenance and tweaking

2. A fast enough lookup mechanism for the categorization

The customer was trying to use XXHash algorithm in ‘C’ to achieve quick look-up times. They experienced various issues like elastic peak load handling, with this approach which they were not able to fully resolve.

We hypothesized that the problem can be addressed with a different technologies/architecture stack.

  1. Since “C” as a language lacked modern benefits we looked for other options. A lot of bugs in modern systems like Chrome surface up due to the memory unsafe operations of the underlying systems written in C. C also does not offer convenient libraries for efficient data structures. We had chanced upon this video earlier of an ideal fast hashmap implementation and confirmation from the Rust documentation that its hashmap was based on the same. (https://youtu.be/ncHmEUmJZf4) . We chose Rust for the following reasons:
  • Better System Language than ‘C’ at almost the same performance. Better memory safety due to the compiler; Rust was created to provide high performance, comparable to C and C++, with a strong emphasis on the code’s safety
  • Easy to deploy. Just the binary can be deployed ( unlike Java )
  • Easier concurrency due to the data ownership model that prevents data races.
  • Zero-cost abstractions.
  • The default HashMap collection in Rust provides an easy way to choose the hashing algorithm.

2. We wanted to check the DNSTap mechanism which had a nice property of exposing the data as a ProtoBuf structure (which enabled code generation and compilation). The DNSTap is a mechanism that is built into the server (see Fig. 1) which enables the capture of a DNS packet without the overhead and also prioritizes processing the packet rather than logging.

3. We conjectured that a stateless architecture can enable scaling horizontally — and thus designed a containerized Cloud-Native Application which can be scaled elastically in a Public Cloud using a Kubernetes-based solution like Azure Kubernetes Service (AKS).

4. Instead of running a heavy database in Azure we tested the usage of a hosted Snowflake Column Store and found that to be much better for the analytics and query processing.

There were a few other similar ideas which we also tried out in this process.

With the right team, it was possible to quickly put each idea to the test, see the results, brainstorm, and quickly move forward. This agile process resulted in an experiment design and execution based on concepts of reusability, scalability, security, and extensibility to systematically identify what components would help the processing of a million DNS records every second.

Over a period of 7 weeks, we built out the experiment and were finally successful using a combination of the ideas above. The last obstacle was the need to generate traffic of a million packets/second. We solved the puzzle by recording the DNS traffic using Linux tools and then replaying them as multiple threads (Think time-compressed photography).

Once we achieved the above milestone of generating traffic and processing of the packets fast enough on a single VM we looked at the other additional requirements to complete the solution. We were able to accommodate all these additional requirements, primarily because the experiment design was flexible enough for us to add additional components quickly.

Since a rollout of this size would always be incremental and in phases there was a need to showcase the output in some form. . We created a portal using React and Node.js and Snowflake DB queries. This portal enabled select customers to log in and see the pattern of their traffic (Top sites visited, Blocked sites, etc.).

One of the operational problems happened to be the “governance of the DNS Zones” and version control of the same using the RPZ concepts of DNS. We demonstrated that using the Open API’s of Github private repositories allows a quick way of testing rollbacks and also changes across versions of the RPZ file.

The GitHub was also the home for the “classification” feeds. This enabled quick wins like version control and programmatic access

We also demonstrated the ability of the system to process DNS over HTTPS (DOH) i.e. the next generation in the evolution of DNS from using UDP on port 53.

Fig 2: Deployment architecture

The resultant deployment diagram is shown in Figure 2 . We are currently proceeding with the implementation of this architecture in production.

To summarize we were able to demonstrate:

  • A scalable Cloud-Native stateless design and Architecture
  • One instance handling 400K packets/second
  • Benefits of building on public SaaS services like Azure and Snowflake in terms of a reduced time to market

We also realized in the process that we could use the e-BPF ( Extended Berkley Packet Filter) mechanism of the modern Linux kernel to replace DNSTap and thus enable the same benefits for customers who are not willing to update their DNS Servers to include DNSTap.

By Prachi Gupta, Vijay Annant Kushwaha, Harihara Natarajan

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store