In this article I will explain how important is to monitor microservices and what are the important things to consider.
microservices, monitoring, logging, traces
Table of contents
When you are working on a monolith, identifying what's going on is relatively easy. You only have one service in the network -the monolith itself- where everything happens; and you can just debug locally and, more often than not, you can find the origin of the issues. But, what happens with microservices, where you have many services in the network, many times splitting stuff into different API calls or messaging reactions? Let's see it.
First, and this is common also for monoliths, you need to learn how to log things that happen in the code. In my experience, most companies just completely forget about logging and only that few lines of logs for the very critical errors of the application. People forgot about log levels, thinking only of the error level.
If you want to have really good logging in place, you need to add logs for almost everything happening in the system at the debug level. When you are trying to debug something in production, the only feasible way to do it -apart from just running locally and trying to simulate the same conditions as in production- is to enable the debug mode. So, to begin, add those lines to the debug mode so can use them in development with the debug mode enabled by default and also in production enabling the mode.
It is also important to note that you should have a very easy way to enable the debug mode in production; if you need to commit on the code, deploy again, fix the issue, and commit again... chances are you are going to try to fix it by just adding "echo" lines in the middle of the code, taking so long to fix the issue.
For critical errors, don't forget to provide enough information in the log. "Something went wrong" is not a good message at all if the intention is to fix the issue. Add the stack trace and any other useful information you can think of.
There's also a different tool that is very useful for monitoring microservices: the traces. Google Cloud has a pretty good one, but in general, it consists of saving a trace on each important part of the code, so you know times between spans -between important parts-, which is very useful for detecting performance issues.
You need to think of the important parts of your code, such as external calls, repositories, and long processes... those are the parts you want to have information about. Also, if you put a trace at the very beginning and a second one at the end of the process, you can have the total times of your API calls.
Distributed logging and traces: the correlation id
As I told you in the introduction, it's not the same monitoring microservices as monitoring monoliths because of the distributed nature of the former. How could we relate logs or traces from one call to another? How can we relate logs or traces from the main thread and a daemon listening to a message? The correlation ids come to save us.
A correlation id is a random id that is generated at the very beginning of the process. When the main process needs to make an external call (or to fire an event), it passes the correlation id in a header; then, the secondary process, instead of generating a new correlation id, uses the one in the header. Obviously, this correlation id is spread over logs and traces. When you want to follow the breadcrumb, you only need to filter by correlation id.
If you don't use correlation ids, you are going to have a very bad time trying to follow things in a microservice architecture.
Having very good logs or traces is completely useless if nobody looks at them. And this is exactly what use to happen when everything is ok and we don't have anybody pushing us because production is down. But many times there are interesting insights we could get from logs or traces that we should be aware of. For example:
- When an endpoint is taking too long consistently: we should be aware of this because we probably have an issue in the code or in the infrastructure.
- When a 400-like error is too frequent: is the user failing that much or do we have an issue?
- When we have 500-like errors: if we are not aware of the logs and traces, the only way we have to identify the problem is by the user alerting us. This happens sometimes, but not always.
So, the only way we have to avoid being reactive to user alerts but being proactive on errors is to have alerts in place. Many cloud providers like Google and AWS have tools to assign alerts to log or trace events. We only need to configure these alerts like "when we have more than 20 400-like errors in the same day, send an email".
Having the proper amount and kind of alerts is absolutely essential if you have a microservices architecture.
Many companies fail when implementing microservices because they are not aware or are not conscious enough of the importance of good monitoring. Problems arise and they feel lost. Consider the time you will need to put this in place -along with other infrastructure requirements- before starting your journey.