Reacting to errors logs

Logs are events that took place in a system and are considered of interest for their content. The most basic log event includes three pieces of information:

  • What happened, or message
  • When did it happen, or timestamp
  • How important is it, or level

The level of a log indicates the severity of its content, usually ranging from informational messages to errors that sooner or later may require your attention, as they may indicate the existence of a bug. I advocate for having a strict zero error policy in my teams, so when an error is logged there should be an action to understand what happened and fix it. Regularly having errors in your logs is intrinsically linked to the normalization of their exceptionality, and therefore to the loss of their perceived value.

Investigating

Once I realize an error was logged, my commitment is to read the log as soon as possible. Usually, this takes a few minutes at most, enough to go through all the fields of the log and make my mind about the next steps to take.

Why did it happen? Where is the offending piece of code? Which system, use case and file does it belong to? Anything suspicious about the input data? Was there any unusual activity in the infrastructure around the time of the error? What are the implications —affected users, invalid data, manual action required…? Could it happen again? If any of these questions rests unanswered, I will estimate how much time do I need to figure it out, and perceive the urgency of the matter. Depending on the time and urgency, I may postpone the investigation to a later time or act immediately.

If the root of the issue is a bug, it could have impacted the business. Make sure to communicate early with your stakeholders and keep them updated with any relevant findings. If the presence of similar errors in the logs is increasing and the impact to the business is confirmed, consider declaring an incident to receive all the attention and support necessary to solve the issue as soon as possible.

The goal of the investigation is to understand the cause of the error and have a high-level idea of how to fix the issue. Sometimes it’s useful to take a look at the surrounding logs as well, in order to understand the state of the application when the error happened. If many operations are being executed concurrently, having correlation IDs can help to trace logs from the same business transaction. My suggestion is to timebox this investigation, and create a follow-up spike if there are still unknowns left by the end.

Fixing

Depending on the estimated effort to fix the issue I will create a task in the backlog or start working on it right away. Sometimes fixing it means starting a conversation with the owners of another service which is causing the mishap. Other times it means decreasing the log level if the error was actually expected and was incorrectly marked as an error; for example, a user trying to buy something when they don’t have credits left.

In case that the root of the error lies in a service owned by my team, the first step is to write a test. This will prevent the incorrect behaviour from happening again, as the test suite is automatically executed as part of the deployment pipeline.

Finally, once the fix has been tested and deployed to production, don’t forget to share the acquired knowledge with your teammates.