When working with code locally, you’re in a super comfortable position - you are in charge of your whole environment, you also probably have an admin rights to everything. You can change the configs as you wish, check all the configs and also attach a debugger and check the inner state of the application you’re developing. However, it’s not always possible. Once you build your application and deploy it into production, it’s much harder or even impossible to attach a debugger to it. But failures are happening not only on a local environment, but also on a staging or production environment. What is even worse, some errors are heppening only on production environment, or under some load from users, so finding them locally could be really hard.

Today I’ll quickly walk you through the available options to mitigate the consequences of not having the same level of access to the deployed application as you have to the one ran localy. These are: tests,metrics and logs. Let’s start!

Tests

First fundament of debugging in production. A comprehensive set of unit, intergration and end to end tests are already giving some level of confidence that your application is working correctly.

Good practice around tests is to split them into categories:

  • Quick unit tests - ran basically always, at every change
  • Slow integration tests - might include some external dependencies, therefore we cannot afford waiting for 5 minutes for tests to run multiple times a day

Depending on how slow the slow tests are, you could run them either once a couple of commits on a local machine (if it takes like 3-5 minutes - just go & make a coffee), or if it takes longer or you feel like they’re holding you, just move them into your CI/CD pipeline - make a flow:

(Build) -> (Quick tests) -> (Slow tests) -> (Deploy to Stage) -> (Deploy to Prod)

There is no sense to run a step if any of the preceding steps failed.


Metrics

Metrics serve two main purposes:

Monitoring the health of your application

  • Healthcheck HTTP code
  • Healthcheck response time
  • Message processed count

This metrics are giving you a sense of security. You can set up alerts on them, and usually these are the most crucial alerts - if your application cannot respond to a healthcheck within 2s over a 10 mins period of time, you’ve got a big problem (usually).

Checking the flow of business processes served by the app

  • Successfully served requests count
  • Failed requests count
  • Exception types & count reported by the application
  • Duration of request handling

These are the metrics reporting the inner state of the application. The world is burning, but maybe there is nothing to worry about. They are helpful when you’re looking for a exceptions in patterns - maybe you have a 1% requests which are served extremely slow? Or you started getting this specific error type for 4 hours only starting at 1am on Saturday? You’d never know it, unless you had a metrics for that.

Logs

To consider logs as a useful debugging tool, they need to be meaningful. They should complement each other with metrics. You had a spike of HTTP 503’s at some point in metrics? You should be able to fill the gap of the story by looking at the logs. Which user or process did what? What was the state of the system? What was the exception message? What was the stack trace? These are the questions you should be able to answer by just looking at the logs.

Just remember - logs need to be meaningful, you don’t need to log everything.