A DNS Performance Incident

At SparkPost, we’re building an email delivery service with high performance, scalability, and reliability. We’ve made those qualities key design objectives, and they’re core to how we engineer and operate our service. In fact, we literally guarantee our service level and burst rates for our Enterprise service level customers.

However, we sometimes encounter technical limitations or operating conditions that have a negative impact on our performance. We recently experienced a challenging situation like this. On May 24, problems with our DNS infrastructure’s interaction with AWS’ network stack resulted in errors, delays, and slow system performance for some of our customers.

When events like this happen, we do everything we can to make things right. We also commit to our customers to be open and transparent about what happened and what we learn.

In this post, I’ll discuss what happened, and what we learned from that incident. But I’d like to begin by saying we accept responsibility for the problem and its impact on our customers.

We know our customers depend on reliable email delivery to support their business, security, and operational needs. We take it seriously when we don’t deliver the level of service our customers expect. I’m very sorry for that, as is our entire team.

Extreme DNS Usage on AWS Network Hits a Limit

Why did this slowdown happen? Our team quickly realized that routine DNS queries from our service were not being answered at a reasonable rate. We traced the issue to the DNS infrastructure we operate on the Amazon Web Services (AWS) platform. Initially we attempted to address query performance by increasing DNS server capacity by 500%, but that did not resolve the situation, and we continued to experience an unexplained and severe throttling. We then repointed DNS services for the vast majority of our customers at local nameservers in each AWS network segment, which were not experiencing performance issues. This is not the AWS-recommended long-term approach for our DNS volume, but we coordinated it with AWS as an interim measure that allowed us to restore service fully for all customers about five hours after the incident began.

I’ve written before about how critical DNS infrastructure is to email delivery, and the ways in which DNS issues can expose bugs or unexpected limits in cloud networking and hosting. In short, email makes extraordinarily heavy use of DNS, and SparkPost makes more use of DNS than nearly any other AWS customer. As such, DNS has an outsized impact on the overall performance of our service.

In this case, the root cause of the degraded DNS performance was another undocumented, practical limit in the AWS network stack. The limit was triggered under a specific set of circumstances, of which actual traffic load was only one component.

Limits like these are to be expected in any network infrastructure. But one area where the cloud provides unique challenges is troubleshooting them when the network stack is itself an abstraction, and the traffic interactions are much more of a shared responsibility than they would be in a traditional environment.

Diagnosing this problem during the incident was difficult for us and the AWS support team alike. It ultimately required several days of effort and assistance from the AWS engineering team after the fact to recreate the issue in a test environment and then identify the root cause.

Working with the AWS Team

Technology stacks aside, we know how much our customers benefit from the expertise of our technical and service teams who understand email at scale inside and out. We actually mean it when we say, “our technology makes it possible—our people make the difference.”

That’s also been true working with Amazon. The AWS team has been essential throughout the process of identifying and resolving the DNS performance problem that affected our service last week. SparkPost’s site reliability engineering team worked closely with our AWS counterparts until we clearly understood the root cause.

Here are some of the things we’ve learned about working together on this kind of problem solving:

  • Your AWS technical account manager is your ally. Take advantage of your account team. They’re advocates and guides to navigate AWS internal resources and services. They can reach out to product managers and internal engineering resources within AWS. They can hunt down internal information not readily available in online docs. And they really understand how urgent issues like the one we encountered can be to business operations.  If a support ticket or other issue is not getting the attention it deserves don’t hesitate to push harder.
  • Educate AWS on your unique use cases. Ensure that the AWS account team—especially your TAM team and solution architect—are involved in as much of your daily workflow as possible. This way, they can learn your needs first hand and represent them inside of AWS. That’s a really important part of keeping the number of unexpected surprises to a minimum.
  • Run systematic tests and generate data to help AWS troubleshoot. The Amazon team is going to investigate the situation on their end, and of course they have great tools and visibility at the platform layer to do that. But they can’t replicate your setup, especially when you’ve built highly specialized and complex services like ours. Running systematic tests and providing the data to the AWS team will provide them with invaluable information that can help to isolate an unknown problem to a particular element of the platform infrastructure. And they can monitor things on their end during these tests to gain additional insight into the issue.
  • Make it easy for engineers on both teams to collaborate directly. Though your account team is critical, they also know when letting AWS’ engineers and your engineers work together directly will save time and improve communication. It’s to your advantage to make that as easy as possible. Inviting the AWS team into a shared Slack channel, for example, is a great way to work together in real-time—and to document the interactions to help further troubleshooting and reproduce context in the future. Make use of other collaboration tools such as Google docs for sharing findings and plans. Bring the AWS team onto your operations bridge line during incidents and use conference calls for regular check-ins following an incident.
  • Understand that you’re in it together. AWS is a great technical stack for building cloud-native services. But one of the things we’ve come to appreciate about Amazon is how openly they work through hard problems when a specialized service like SparkPost pushes the AWS infrastructure into some edge cases. Their team has supported us in understanding root causes, developing solutions, and ultimately taking their learnings back to help AWS itself continue to evolve.

The AWS network and platform is a key part of SparkPost’s cloud architecture. We’ve developed some great knowledge about leveraging AWS from a technical perspective. We’ve also come to realize how important support from the AWS team can be when working to resolve issues in the infrastructure when they do arise.

Looking Ahead

In the coming weeks, we will write more in detail about the DNS architecture changes we are currently rolling out. They’re an important step towards increasing the resilience of our infrastructure.

Whether you’re building for the AWS network yourself, or a SparkPost customer who relies on our cloud infrastructure, I hope this explanation of what we’ve learned has been helpful. And of course, please reach out to me or any of the SparkPost team if you’d like to discuss last week’s incident.

—Chris McFadden
VP Engineering and Cloud Operations