Error analysis and troubleshooting of problems in the production environment of distributed systems

Guide: Troubleshooting and fixing production issues in distributed systems

Distributed systems are the backbone of many modern software applications and platforms. They enable service scalability and availability by distributing workloads across multiple machines and geographic locations. However, as with any complex system, production problems can disrupt service and impact users.

As a DevOps or Infrastructure Engineer, it is important to have the skills and knowledge to troubleshoot and resolve common production issues in a distributed system. These problems range from simple configuration issues to complex system architecture failures.

Without the ability to troubleshoot and resolve these issues, businesses can suffer significant consequences such as lost revenue, damage to their reputation, and decreased customer satisfaction.

But let's look at how you can fix common production problems in a distributed system.

Post-mortem analysis: Identifying the cause of the problem

In modern (microservices) deployments, software teams typically fix two different things during a post-mortem.

  • Traces of the network flow within the (microservice) components with tools like Kiali, Jaeger, Istio
  • Infrastructure components such as runtime, artifacts and more.

More importantly, modern software is developed to be self-healing. To achieve this, software teams ensure that the software is properly tested during the development phase, e.g. through unit tests and automated integration tests.

Let’s dive into the process.

Information collection on the problem

The first step is to gather as much information as possible about the problem to determine the root cause of a production problem in a distributed system. This may involve analyzing logs, monitoring data and generated error messages. Log management and monitoring tools can help collect and organize this data so that it is easier to analyze.

To collect the information, you can use the following tools:

Identification of patterns and correlations

Once you have a clear understanding of the problem, the next step is to identify patterns and correlations that may point to the root cause of the problem. This may involve looking for trends or changes in the data that occurred around the time of the problem. Visualization tools and anomaly detection tools can also be helpful here, as they can help identify unusual patterns or deviations from normal behavior.

You can use the following tools to detect patterns and correlations:

  • Visualization tools: e.g. Kibana, Datadog
  • Anomaly detection tools: e.g. New Relic

Use of debugging tools and techniques

Once you understand the problem and the possible causes, it's time to start troubleshooting. This may involve using tools such as debuggers and profilers to understand what is happening at a deeper level within the system. It's also important to be systematic in your troubleshooting. Start with the most likely causes and work through the list until you find the root cause.

Essential tools for troubleshooting:

  • Debugger: e.g., GDB, LLDB
  • Profiler: e.g. perf, VTune

Set priorities and structure the troubleshooting process

One of the most important steps in troubleshooting and resolving production problems in distributed systems is prioritizing and organizing the process. This includes determining the impact of the problem on users and the system, and creating an action plan and timeline for resolution.

Determining the impact of the problem on the users and the system

To determine the impact of the problem, it is important to consider factors such as the number of users affected, the severity of the problem, and the potential consequences if the problem is not fixed. This information can be used to prioritize troubleshooting and problem resolution according to their importance.

Establish an action plan and a timetable for solving the problem

Once you have identified the impact of the problem, it is important to create an action plan and a timeline for resolution. This may require breaking down the troubleshooting into smaller, manageable tasks and setting a deadline for each task. It is also critical to include all necessary parties, such as developers and IT support, in the troubleshooting process and assign specific tasks to each team member.

By organizing and prioritizing the troubleshooting process, you can ensure that you resolve the issue promptly and efficiently, minimizing the impact on users and the system.

Fixing the problem

Temporary bug fixes to minimize impact on users

When a production problem occurs in a distributed system, it is important to minimize the impact on users. This may include temporary solutions, such as disabling certain features or redirecting traffic to another server until a permanent solution can be implemented.

It is important to carefully consider the possible consequences of a temporary solution and ensure that it does not cause further problems or complications. Also, try to inform users and affected parties about quick solutions so that they are aware of the situation and the potential impact.

Implementation of permanent solutions

Once the impact on users has been minimized through temporary solutions, the next step is to implement permanent solutions to address the root cause of the problem. This may involve changing the system architecture, updating software or hardware, or implementing new processes or procedures.

Any permanent solution must be carefully planned and tested to ensure that it is effective and practical. It may also be necessary to involve outside experts or vendors if the problem requires specialized knowledge or resources.

Testing and verification of the solution

Once a permanent solution has been implemented, it is important to thoroughly test and verify that the solution is effective and has no unintended consequences. For example, you can run stress tests, run simulations, or monitor the system to ensure that the problem does not reoccur.

Testing and verifying the problem resolution is an important step in troubleshooting, as it helps to fix the problem and ensure that the system is working correctly. In addition, it is important to document the testing and verification process for future reference.

Most importantly, modern software is developed to heal itself. To achieve this, software teams ensure that the software is properly tested during the development phase, e.g. through unit tests and automated integration tests.

These tests cover edge and corner cases.

The software project should include a QA team and have the following environments:

  • Development/Quality Assurance (dev/QA)
  • User acceptance testing (UAT, copy of prod)
  • Production environment

Once the code and tests are working in the development/quality assurance environment, it should be transferred to the UAT environment, which contains a copy of the production data (update process).

Good, structured code is essential in any software project. In this article you will find some tips: Read the article.

Review after an incident

Analyze the cause of the problem

After fixing a problem in a distributed system, it is essential to perform a follow-up investigation to determine the root cause and prevent similar problems from occurring.

This may include analyzing logs, monitoring data, and other relevant information to understand what caused the problem and how the root cause was resolved. It may also include gathering feedback from users and other stakeholders and performing root cause analysis such as the 5 Whys method.

The goal of the post-incident review is to identify any underlying issues or vulnerabilities in the system that may have contributed to the problem and take preventative measures to avoid similar situations in the future.

Implementing preventive measures to avoid similar problems in the future.

Once the cause of the problem has been identified, the next step is to take preventive measures to avoid similar situations. This may involve changing the system architecture, updating software or hardware, or introducing new processes or procedures.

It is important to carefully plan and test all preventive measures to ensure they are effective and do not have unintended consequences. It may also be necessary to involve outside experts or vendors if the problem requires specialized knowledge or resources.

Document the process for future reference

In addition to implementing preventive measures, it is important to document the entire troubleshooting process for future reference. This can help identify any patterns or common issues that occur in the system and can serve as a valuable resource for future troubleshooting efforts.

Documenting the process can also improve communication and collaboration between teams and serve as a learning opportunity for continuous improvement.


Troubleshooting and resolving production problems in distributed systems is critical to maintaining system functionality and reliability. This includes identifying the source of the fault, prioritizing and organizing the troubleshooting process, fixing the problem, and conducting a post-incident review to determine root causes and prevent similar problems.

Effective troubleshooting requires careful planning, attention to detail, and a proactive approach to continuous learning and improvement. By taking these steps, organizations can ensure that production issues are resolved promptly and efficiently, minimizing the impact on users and the system.

It is important to prioritize the resolution of production issues in distributed systems, as these issues can have significant consequences if left unaddressed. By taking a proactive approach to troubleshooting, organizations can maintain the reliability and functionality of their systems and provide a seamless experience for their users.

Incident post mortem

Post Mortem on Incidents - How to Manage Downtimes

Errors are human and can cause simple or even severe incidents. Let’s face it, we can try to prevent making mistakes, but sooner or later it will happen. Yet making mistakes isn’t the biggest issue here. We need to make sure that we do learn from our mistakes. Introducing a post mortem after an incident or error has occurred in your projects enables you to learn from past mistakes. In this post, we will cover the introduction of a post mortem culture into your DevOps organization and why blaming someone doesn’t solve anything.

The best way to learn from incidents is to institute post mortems.

What is a post mortem in projects and incidents?

Incident post-mortems, or retrospectives, are evaluations of how the incident occurred, what implications it had on business metrics and goals and how the team fixed the errors.

Why do you need post-mortem analysis in DevOps?

Many companies, big or small experience major incidents at least several times a year. As mentioned, you can work to prevent incidents, reduce the impact and shorten their implications on your goals or other important KPIs. But they will still occur, no matter what.

Changes in your systems, code or infrastructure can introduce instabilities that cause incidents. Being a DevOps champion, you probably release new iterations of code or updates at a high frequency. This reduces the risk of failures of individual releases. But at the same time, the increasing number of releases most likely will not reduce the number of incidents, but the probability of an entire system going down is reduced drastically.

But what happens when a critical incident occurs?

Rather than playing the blame game and pointing the finger at the person responsible, the only thing that counts is to find what caused the error that lead to the incident and if necessary mitigate the impact.

Analyzing the root cause and implementing preventive measures is important to make sure such incidents do not occur too often. Otherwise, incidents may start multiplying and fixing errors becomes part of weekly routines. Sooner or later, the only thing your teams will work on is incident response. And nobody wants that! To break out of this spiral, your team needs to acknowledge the new status quo.

And what does a postmortem analysis look like?

In order to learn from past mistakes, we must conduct a detailed analysis of events following incidents.

A postmortem should be initiated whenever an incident requires an engineer to respond. Normally, a postmortem (analysis) registers the following objective evidence:

  • What triggered the incident?
  • What was the impact?
  • How long did it take to detect and mitigate the incident?
  • What steps were taken to mitigate the incident?
  • Has the team conducted a root cause analysis?
  • Can we develop a timeline of significant activity: Centralize key activities from chat conversations, incident details, and more.
  • What are the learnings and next steps?: What went well? What didn’t go well? How do we prevent this issue from happening again?

In most cases, analysis is conducted by team members who responded to the incident and mitigated or investigated the root cause.

Example: Facebook, Whatsapp and Instagram outage/downtime October 2021

On the 4th of October 2021, multiple public services of Facebook experienced a worldwide downtime for almost 24 hours. As Facebook stated, configuration changes on the backbone routers that coordinate the network traffic between data centres caused issues that interrupted communication.

Learn more about the service outage and postmortem at Meta in this article.

What Facebook did seems very simple at first glance: they followed their internal "Storm Drills" process. This is how Facebook makes sure they always know exactly what to do next when something happens.

Hereafter, they made sure to perform an incident postmortem to find out what really happened.

Introducing blameless post mortems

The process of conducting post mortems seems quite straightforward. Yet, for many organizations that have never thought about implementing post mortems into their incident response, this might be a challenge you shouldn’t take on too lightly.

The introduction and ongoing success of any new or changed process require time and effort from all levels of the organization.

To make the transition easier, there are a few key principles to follow:

  • Make sure you stay away from blame games and finger-pointing: This is the most crucial aspect of getting things right out of the gate. If the analysis focuses on blaming the persons causing the incident instead of making sure the team learns and improves, the initiative will cause harm instead of good.
  • Communicate openly and mistake-friendly: Make sure that post mortem meetings aren't done to find someone to blame. It's the sole opportunity for the teams to learn and improve. This means being honest about what happened and correcting expectations.
  • Introduce a post mortem leader: Having a dedicated lead will ensure each and every incident response is finished with a postmortem. These leaders usually have a broad understanding of all services and DevOps. The leader sets the tone for post mortems and largely determines the collective attitude of blamelessness. As a DevOpsCollaborate, share insights and foster communication:
  • Work together, share information and foster communication: Make sure that the postmortems are well documented in an internal platform (such as a Confluence wiki). Every post mortem can then be used as viable training material for your teams.
  • Bring top management on board and include all relevant stakeholders: To evangelize every team member in the organization, top-down communication is essential. But at the same time, managers need to know what happened. So make sure to communicate clear numbers and insights.
  • Make decisions: Ideally, a good blameless postmortem will provide preventative suggestions. You need to identify who is responsible for approving recommendations and reviewing the written reports.