DevOps Horror Stories
Stories From Beyond The Grave For Your Halloween Enjoyment

John
Wherever We Go, So Go The Networking Problems
A Terrifying Tale By John Allison, CTO, Customer.io

Customer.io is running some heavy tech for 1 guy to manage, and you guys haved moved data centers a few times, can you tell me more?

Customer.io has lived in 3 data centers so far, each with catastrophic network issues of their own. We can't tell if the servers we have bring the plague with them, or if it's just coincidence, but it sure is eerie. All of these providers have otherwise stellar reputations, but for some reason we seem to break them down regardless. Each time we move we think we're done moving, and soon enough something causes us to have to pick up the whole stack and move continents again.

Where did you start, and what was the first last-straw problem you experienced?

We started out with our whole stack in Linode's New Jersey data center. It worked fine for a while, until the whole data center was seeing networking flickers, sometimes as often as every couple hours, where we were in New York city and couldn't even SSH into our own machines. If this happens once it's sometimes forgivable, but this kept persisting for weeks and the technical support people weren't putting as at ease about the root cause or when it would be fixed.

That's terribly unsettling, and I'd likely move too if that was the case. What was your next attempt?

Our next attempt was with Hetzner in Germany. Their prices vs their machine size was great and we got beefier machines vs what we could get with Linode so we were initially really excited about that. After everything got moved over, a couple months went by and we started noticing that every 10 seconds or so the latency between the nodes - nodes that were allegedly on the same rack in their DC - would spike from like 0.75ms to 2000ms.

Unfortunately, after dealing with the technical support at these orgs, they're almost uniformly unhelpful and the conversation follows the same pattern. You describe the problem in technical way, and most of the time you get a generalized response saying “please prove that there is a problem on our side and not your side”. They give you a few commands to run, and you paste the output into the email, a couple days go by and you're still losing sleep, etc.

In this case, they responded to our network traces and predictability of the issue with something like "we focus on the stability of our network, and not latency". It was impossible to convince them that a stable network should be predictable and not have 2000ms latency spikes, but they unfortunately weren't much help. It was time to pack up and move again.

Okay, so 0 for 2 attempts, 3 times a charm?

We're hoping that this the case. Our current home is with OVH in Canada. Things have been going well so far, we did the move about 3 months ago, but just a couple weeks ago 2 of the fiber connections into the DC just snapped under a bridge. I believe it was due to maintenance workers that accidentally cut into them, but all that was left over was an emergency backup network route that obviously become 100% saturated as most or all of the DC traffic was running over that single line.

So you run away from the network gremlins, and they show up outside the DC? What is the customer response like during something like this?

Ya, all we could do was laugh, third time definitely not a charm. Throughout most of our issues, historically, our customers have been very understanding. We tell people exactly what’s happened, keep updating during the incident, provide a solid postmortem - if they love the product and love what you’re doing they’ll be understanding. Just be human about it affecting their business, the last thing we want to do is be the corporate person who doesn’t understand you.

Any other notes or ways you've helped make the system more stable?

At some point we moved all of the event collectors onto EC2 to run round-robin on both east and west coast. Those have been incredibly stable, and it lets us aggregate events off of the main stack for even more stable collection. We can pause the event collectors from forwarding to the main stack on a site-wide or account-specific basis, and lets us do extended maintenance or handle with grace the inevitable future network issues we're counting on having :)

The only other curse-of-the-network issue we've had so far is when the .io TLD went completely down due to the registrar servers not responding to whois lookups. These are the types of things you can't count on, and you begin scratching your head for how to even plan for something like this in the future.

So all of this headache, how has it aligned your thinking about how you as a company view downtime and the support thereof?

In general, we try to align our downtime with our customers' downtime. At the end of the day, you have to pick a final failure point, and it's best if that can align with your customers' failure points as well. Namely, we're relying heavily on Amazon's ELB and the .com registrar to be up and functioning at all times, and we like this tradeoff because if the .com registrar or ELB is down, the whole internet is pretty much coming down with it. If our customers' sites are down, their email deliverability and event triggering is seemingly the last of their worries, so we're betting on the same infrastructure they are to align our incentives and lessen the impact of us having downtime issues.

On the human side of things, we have great rapport with them through our support efforts each and every day. This functions much like an insurance policy when we have unscheduled downtime or emergency maintenance. On the whole, a great strategy for lessening the impact of downtime is to make the uptime situation the best it can possibly be for the customer.

Josh
Drunkenly Monkey-Patching The Paperclip Gem
A Mysterious Memoir By Josh Dzielak, VP of Engineering, Keen IO

So Keen.io just had a recent outage, but you're insisting that we talk about another story. I hope it's good.

Oh, this one's a doozy. I hope I don't have a story that beats this one, because it would be touch to imagine how horrific and obscure it would be

Touché. Well, let's just dive right in. Set the stage for everyone.

Right, so my previous company was called Togetherville, it was a social network for families, I was a founding engineer. One of the most popular features was a card builder applet, kids 6-10 years old could draw and then share the card with their friends, families, neighbors, etc, in their network. You can imagine the times of highest traffic for this feature were around the holidays, beacuse that's when kids would sit down and make the most cards.

I'm laughing at where this is headed already.

Yep, we're going there. So it's Christmas Eve at about 10pm. I'm more than a few drinks in and I get a phone call from my CEO who says his kid (he had 3) is trying to create a card for the family, and he keeps clicking the save button and it keeps failing! Not only was it not saving, but after 30 minutes of him building the card it would just spit out this cryptic error message and refres the page. All the work was lost each time, and these kids were starting to cry out of frustration!

So I hopped on to make sure all these kids weren't just bad at using the site, and confirmed in the rails logs that it was in fact a 500 error getting spit back by the card save endpoint. We store all the items in S3 using the paperclip gem, like everyone else, and it was just failing with a very generic error along the lines of "could not save object".

Was your credit card expired? This smells a lot like the Azure SSL certificate expiry from a couple years back

That was my initial inclination, but after checking out to make sure the bill has been paid and keys were correct I had to rule that out. So I got to googling and, by some miracle's chance, happened to come across a post that said "you can't have more than 64,000 objects in a folder.” A light bulb turned on, and I realized the way we were using Paperclip meant we could be hitting this limit.

Sure enough, It turned out to be a linux inode limitation in a secondary production server we used for uploading to S3. And we hit it. 10PM on December 24th. Kids were crying. So I did some more searching, and sure enough other paperclip users were reporting the same exact issue. We had found our culprit.

Was a fix already out? This sounds like one of those cases where the gem is 2 years old and you have to do it live in production to get it patched up.

That's exactly right. A fix was already out on master that would interpolate the model name, a style of the image, etc, that would fix the folder issue. But, too, paperclip uses a formula to generate the S3 path - the interpolation I just mentioned. Just pushing the new gem would have hosed the ability to retrieve the previously uploaded cards since the paths were gonna be off, and at one point I was about to just give up and concede that all previous cards were lost if we wanted to fix this issue at all for the remainder of Christmas Eve, New Year, etc.

In the end, I couldn't ruin Christmas by hosing a bunch of previous cards. So there I sat, half drunk, 2 hours before Santa was set to arrive, monkey patching the paperclip gem to key off of a created_at field, and instructing it to use the old interpolation if the card was created before X time, and use the default interpolation if the card was created after X time. One of the dirtiest pieces of code I've written to date.

I saved the children.

Okay, so fix is pushed. You go right to bed. Of course it works through the night =D. Did you put off the mountain of email until the morning? And where were the other devs?

Ya, worked fine through the night. With a non-technical crowd, any communication with them short of "it's fixed, try again" just isn't acceptable, so we waited until the morning when we could use a spray-and-pray "it's fixed" type of autoresponder.

The other devs on the team didn’t have any experience with Paperclip either, and would have been equally as lost trying to diagnose or help out. You know with problems like these, it's often counterproductive to get another set of hands on deck when you're at the point of monkeypatching because it may take an hour just to set the stage for them to understand the context and magnitude of the problem. I just stayed heads down to get it fixed all in one state of flow.

Is the patch still around?

Doing the migration would have involved downloading and re-uploading hundreds of thousands of cards, and nobody was interested in doing that. So we just left it in.

Scott
We Cut The Network Interfaces. Both Of Them.
A Spine-Tingling Story By Our Very Own Scott Klein

What was the first sign of the slowdown?

This was a previous job, it was late on a Friday afternoon, maybe 4p ET, and the site slowly started coming to a crawl. It wasn't immediately clear what the cause was until the rails processes started throwing connection errors as they waited for too long to connected to the database. In some cases they were turned away by MySQL with a "too many clients" type of error.

Had you added any capacity recently? What triggered it?

Seemingly nothing, unfortunately. We later found that one of the background job machines was leaking db connections, and that was the cause. There was a saving grace, though, and that is that MySQL reserves a few connections in the total pool for the superuser to always be able to get to the CLI.

We tried logging in with the superuser role, but it kept getting rejected. After checking all the nodes to see if something had a screen open from months ago, it still wasn't allowing connections.

What options did you have at that point? It sounds like you're pretty stuck.

At that point ya, pretty stuck. The next step was to try to reboot the process. The database would come up cold, but it was better than processes failing. We initiated the restart and it just hung. "Service MySQL restarting..."

We waited for minutes. The site was still up, still throwing errors, no clue what the restart was doing. The mysqld process was at normal CPU levels, so we weren't sure if it was working on the shutdown. We didn't want to just kill the process in case any tables were still open, that sounded like a data loss nightmare.

So the last ditch effort was stuck. Now what? kill -9?

We tried to reason as to why the restart was failing. Our guess was that the restart process was waiting for a connection to get into the MySQL CLI to initiate the shutdown, and just hanging as there were no connections left in the pool.

At some point, I suggested that we should bring the network interfaces down. This would signal to the kernel to release these connections, and that would bubble up through mysqld and open up some slots for the restart to actually proceed. Each node had 2 interfaces, 1 for intranet connections and 1 available to the public internet.

eth1 goes down, restart process still hanging

start typing...ifdown eth0

Right as the enter key is hit, this booming "WAIT!!" comes in from the background.

We had cut the network interface hosting the ssh session, and the internal interface was just previously cut. No persistent terminal access available through the host.

So we called the host. Hard reboot.

Were you trembling? Couldn't all the data be gone? That must have been a long 2 minutes.

Longest 2 minutes, and completely silent for the first 30 seconds as everyone stared at each other. You then immediately start scheming to try and figure out how you can quickly assess the damage and the remediation plan assuming the worst has already happened.

Once the node came up we immediately went through all the tables to select some data and make sure the table hadn't corrupted in the process.

Lucky for us, everything was fine. Unlucky for us, the database was completely cold, and warmed up over the next 12 hours. Lucky for us, it was Friday at 530p at this point. Far from peak traffic.

We played a good amount of mysqltop whack-a-mole for a while on any query that lasted longer than a couple seconds. It certainly helped, but wasn't a silver bullet, and eventually just had to pack it up to head home.

Todd
The 3½ Day Backup
A Nightmarish Novel By Todd Mosier, Zone Five Software

So your role right now is much different than it used to be, what was your job in a former life?

Definitely. We're just a 2 person team right now at Zone Five, but I used to work for a big multinational corporation that did tech stuff for hospitals. Everything from selling software to running ops. I was on-site often, and you wouldn't believe some of the things I saw.

You've mentiond a drip pan story to me before, is this heading where I think it's heading?

Not quite, but you kinda get the picture. I was called in to a hospital that had just fired its whole C-suite, so there was only a temp CTO. One of the employees there had bought an IBM tape deck run by Tivoli Storage Manager, but didn't buy the arm to link the primary and extension decks. They called us in because the temp CTO wanted the permanent job and was watching everything like a hawk. Solaris boxes, I had had about 2 hours of solaris experience at university, and they somehow thought this was okay.

And yes, the server room had an AC unit with a drip pan for the condensation. It had to be dumped daily. Nothing to see here...carry on...

What was the new tape deck for?

HIPAA requires frequent backups, and this hospital hadn't taken one in months. Totally in violation of regulation. Bad stuff. I started doing the backup manually as my first thing just to get something in place for DR, but because it was running live I had to nurse it and play tricks with the doctors.

I'd run it at nice-20 during the day, then was adjusting the niceness level according to doctors' complaints of slowness. When they complained, I notched it up. When they changed shifts - 2am, i think - i notched it back down again. It took 3 1/2 days to finish.

This sounds like a 1970s horror nerd flick. Hopefully you were out of there quick.

3 weeks. Not 1970s, but close...2007. I ended up with the poor taper operator crying because she still had to come in on saturdays to change the tape backup for the meta data on the solaris box. I ended up writing her a DOS batch file for EMC NetWorker. 30 lines, almost all echos, and my company ended up selling it as part of our solution.

Making money for your employer, one echo at a time. Tell me your batch file lived on to make it into marketing material and a line item in contracts.

I really hope it did, but I left shortly thereafter so I guess I'll never know. I guarantee you that tape library is still there, and will be for 5+ years.

Sorry, I forgot you mentioned 3 weeks? How did you cope?

Luckily there was a beach nearby. And a Friendly's. How much ice cream can one person eat in 3 weeks? More than his fair share.

Upvote this on Hacker News.

Have horror stories you'd like to feature for next year? Get in touch at hi@statuspage.io.