"Over-communication internally is key. If I know something is an issue and an engineer knows about it, it doesn't mean everyone knows...we need to make sure that all stakeholders are on the same page."
Senior Support Engineer, Mixpanel
Mixpanel’s two-time award-winning Support team is the heart of the company. So when the leading user analytics platform does have a problem to solve, the skilled team of incident responders, know exactly how to quickly resolve problems as a team. Even though downtime is a rare occurrence (Mixpanel has an impressive track record of 99.98% uptime, which translates to a mere 17 seconds a day) they are always prepared and able to build trust with their customers when it matters most.
What makes them so good at incident response? Our bet's on the strong collaboration and trust formed between dev and support teams, their excellent processes and documentation, and a habit of over-communicating both internally and externally when things go wrong. We caught up with a few members of their support team and an engineer to learn more.
Unify dev & support
Traditionally, technical folks (developers, SREs, etc.) are the ones getting paged when something goes wrong. But at Mixpanel, the Support team leads the organization through incident response, too.
Support team members are on the on-call list right beside their engineering counterparts so they can start updating users as soon a issue is detected. They work in lockstep during incident response so customers receive the most up-to-date and accurate information as possible. Jira tickets, dedicated Slack channels, and Statuspage act as sources of incident truth that keep teams in-sync during an incident.
Tools are only part of the equation during incident response. The Mixpanel team also has well-defined roles, a communication style guide, and incident communication templates down pat before an incident strikes so everyone is aligned when it matters most. They created the style guide in collaboration with their marketing team so they could quickly reference tips for tone, words to avoid, etc. while writing incident updates. One of their guiding communication principles is to be "honest, but not alarmist", aiming to be as transparent as possible, without ever giving users inaccurate or irrelevant information.
Ultimately, Mixpanel is able to provide legendary support not only with their solid technical skill set, but also with a deep level of empathy. By quickly identifying the root cause of someone’s question, Support Engineers are able to connect and teach customers how to make more informed decisions about their products and company, faster. By updating users early and honestly, they're able to clear up confusion and build lasting trust.
Over-communicate to stay in sync
Clear, comprehensive, and organized communication is the name of the game during incident response. "Over-communication internally is key," Will told us. "If I know something is an issue and an engineer knows about it, it doesn't mean everyone knows...we need to make sure that all stakeholders and all people communicating with customers are on the same page."
Mixpanel organizes communication during an incident by breaking out different types of conversations into different Slack channels and documenting which channels to use for what. Anyone can reference these documents and jump into the right chat at the right time. For example, they talk through incident fixes in their "Ops team" channel, but use "downtime chatter" for related convos not connected to the actual fix. Strong collaboration and communication internally helps them deliver quick and consistent comms externally.
Share an incident story or tip, get a poster!
Share your best story and/or incident response tip and we'll mail you a free HugOps poster to proudly display in your workspace.