DNS, ELBs, ANAMEs, and TTLs, Oh My!
We had an issue this morning with Dead Man’s Snitch. We have Pingdom monitoring for basic uptime (among other tools) and it started reporting intermittent downtime from certain regions of the world, which we couldn’t reproduce. Mistake #1 We assumed this was a network connectivity error, or something on Pingdom’s side, as everything looked good from our end.
We got one report from a user (thanks Mike!) that simply asked:
Why does deadmanssnitch.com redirect to here.com right now?
Our first stop whenever there’s a DNS issue is to look at OpenDNS’s fantastic Cache Check tool to see what records are being reported around the world.
Although it is correct in this picture, when we first checked it listed completely different IP addresses in a handful of places around the world. Uh oh, that’s not good! At the bottom of that page, there’s a button to “Refresh the Cache”. This is fantastic when you have a DNS problem, so we immediately clicked it to get them all reporting the same IPs.
Now we have to figure out how it got that way.
At that point, we jumped back into Pingdom and looked deeper into the errors.
Because we use the “naked domain” deadmanssnitch.com (which I typically would recommend against, but that’s another post) we have to do some magic to get to the Elastic Load Balancers (ELBs) that actually power the application. For that, we use an ANAME record at DNS Made Easy. This allows us to point
elb035092-1670947580.us-east-1.elb.amazonaws.com, which points to 3 IP addresses.
Using an ANAME is a bit of a hack to get around the fact that Heroku and ELB don’t have static IPs we can point to. It effectively goes up the DNS chain to find out what IPs are in-use, and then reports them when someone does a DNS lookup for
deadmanssnitch.com. With a CNAME you can set the Time to Live (TTL) to be nice and long. Mistake #2 we set a long TTL for our ANAME.
Sometime last night, the IP addresses changed for our ELBs to a new set of three. This is common, and not usually a problem. However, because our ANAME was caching the IPs for too long, it wasn’t seeing the change right away.
Our solution was to change the TTL for our ANAME to under 1 hour. Amazon’s Elastic IPs (which ELB uses) will fortunately route traffic to the correct server for at least 1 hour after a change. Our TTL was originally much longer than that, now it is 30 minutes.
The moral of the story? Don’t make assumptions with TTLs, and always read error messages. You may be surprised at what’s actually happening!