Post Mortem on Incidents - How to Manage Downtimes

Mario Schaefer
January 24, 2022
DevOps

Errors are human and can cause simple or even severe incidents. Let’s face it, we can try to prevent making mistakes, but sooner or later it will happen. Yet making mistakes isn’t the biggest issue here. We need to make sure that we do learn from our mistakes. Introducing a post mortem after an incident or error has occurred in your projects enables you to learn from past mistakes. In this post, we will cover the introduction of a post mortem culture into your DevOps organization and why blaming someone doesn’t solve anything.

The best way to learn from incidents is to institute post mortems.

What is a post mortem in projects and incidents?

Incident post-mortems, or retrospectives, are evaluations of how the incident occurred, what implications it had on business metrics and goals and how the team fixed the errors.

Why do you need post-mortem analysis in DevOps?

Many companies, big or small experience major incidents at least several times a year. As mentioned, you can work to prevent incidents, reduce the impact and shorten their implications on your goals or other important KPIs. But they will still occur, no matter what.

Changes in your systems, code or infrastructure can introduce instabilities that cause incidents. Being a DevOps champion, you probably release new iterations of code or updates at a high frequency. This reduces the risk of failures of individual releases. But at the same time, the increasing number of releases most likely will not reduce the number of incidents, but the probability of an entire system going down is reduced drastically.

But what happens when a critical incident occurs?

Rather than playing the blame game and pointing the finger at the person responsible, the only thing that counts is to find what caused the error that lead to the incident and if necessary mitigate the impact.

Analyzing the root cause and implementing preventive measures is important to make sure such incidents do not occur too often. Otherwise, incidents may start multiplying and fixing errors becomes part of weekly routines. Sooner or later, the only thing your teams will work on is incident response. And nobody wants that! To break out of this spiral, your team needs to acknowledge the new status quo.

And what does a postmortem analysis look like?

In order to learn from past mistakes, we must conduct a detailed analysis of events following incidents.

A postmortem should be initiated whenever an incident requires an engineer to respond. Normally, a postmortem (analysis) registers the following objective evidence:

What triggered the incident?
What was the impact?
How long did it take to detect and mitigate the incident?
What steps were taken to mitigate the incident?
Has the team conducted a root cause analysis?
Can we develop a timeline of significant activity: Centralize key activities from chat conversations, incident details, and more.
What are the learnings and next steps?: What went well? What didn’t go well? How do we prevent this issue from happening again?

In most cases, analysis is conducted by team members who responded to the incident and mitigated or investigated the root cause.

Example: Facebook, Whatsapp and Instagram outage/downtime October 2021

On the 4th of October 2021, multiple public services of Facebook experienced a worldwide downtime for almost 24 hours. As Facebook stated, configuration changes on the backbone routers that coordinate the network traffic between data centres caused issues that interrupted communication.

Learn more about the service outage and postmortem at Meta in this article.

What Facebook did seems very straightforward at first glance, they followed their internal process following their storm drills. To basically make sure that whenever something happens, they know precisely what to do next.

Hereafter, they made sure to perform an incident postmortem to find out what really happened.

Introducing blameless post mortems

The process of conducting post mortems seems quite straightforward. Yet, for many organizations that have never thought about implementing post mortems into their incident response, this might be a challenge you shouldn’t take on too lightly.

The introduction and ongoing success of any new or changed process require time and effort from all levels of the organization.

To make the transition easier, there are a few key principles to follow:

Make sure you stay away from blame games and finger-pointing: This is the most crucial aspect of getting things right out of the gate. If the analysis focuses on blaming the persons causing the incident instead of making sure the team learns and improves, the initiative will cause harm instead of good.
Communicate openly and mistake-friendly: Make sure that post mortem meetings aren't done to find someone to blame. It's the sole opportunity for the teams to learn and improve. This means being honest about what happened and correcting expectations.
Introduce a post mortem leader: Having a dedicated lead will ensure each and every incident response is finished with a postmortem. These leaders usually have a broad understanding of all services and DevOps. The leader sets the tone for post mortems and largely determines the collective attitude of blamelessness. DevOpsCollaborate, share insights and foster communication:
Work together, share information and foster communication: Make sure that the postmortems are well documented in an internal platform (such as a Confluence wiki). Every post mortem can then be used as viable training material for your teams.
Bring top management on board and include all relevant stakeholders: To evangelize every team member in the organization, top-down communication is essential. But at the same time, managers need to know what happened. So make sure to communicate clear numbers and insights.
Make decisions: Ideally, a good blameless postmortem will provide preventative suggestions. You need to identify who is responsible for approving recommendations and reviewing the written reports.