Metrics Monitoring
February 08, 2024
Overview
Metrics Monitoring is not a standard term used in the industry, but it does point to the greater domain of Observability. More specifically, it points to Metrics, Monitoring, Operations, and Alerts. You could also add Logging, but Logging deserves its own blog post (although it is touched on in this blog post). Same can be said for Tracing.
Table of Contents
Metrics
There is a threshold to be aware of whether an operation or action in a system should be logged or, alternatively, if metrics should be emitted for it. Logs should be more common, but it's important that the logs follow a standardized, easily searchable pattern that adheres to established log levels. The threshold for whether logs or metrics are more appropriate depends on whether an alert is merited for this operation if something goes wrong. However, it's a great idea to have analytics to aggregate the total number of logs per period of time based on log levels and any other filters that are important to the system's operational health. These then become lagging metrics which can then raise alerts if thresholds are breached.
Note: The term lagging metric originates from technical analysis in investing (lagging indicator). In this context, it refers to a metric that is not emitted immediately, but is instead created from some post-processing or analytics.
Monitoring
Monitoring is the operational process of observing Dashboards that contain metrics and alert statuses. These Dashboards are important interfaces to communicating the state and health of a system. Therefore, their arrangement and visual clarity are critical to maintaining operational excellence and preventing operational gaps where observability tools actually have coverage.
Alerts
Alerts are active notifications that are triggered after certain conditions are met for a metric. These conditions can take several forms, for example:
- A single data point for a metric crosses a threshold (could be greater than, less than, or any other inequality). Use cases for this are numerous (e.g. any failure to complete a task).
- Multiple data points over a period of time cross a certain threshold. One possible use case for this is a system workflow that is expected to experience delays in consumption. Only extended delays should raise an alert.
Finding the right threshold to achieve a balance between signal vs noise can be difficult and often requires more than one iteration to succeed. This difficulty is compounded by the fact that alerts can easily experience a large growth in numbers in larger, more mature systems with many low priority alerts being ignored for long periods of time (possibly forever). This scenario can be alleviated if teammates come together to categorize alert-created tickets to understand what changes could be made to eliminate the prevalence of the tickets.
Conclusion
Optimal Monitoring involves more than just Metrics, or even Alerts. It requires coverage (especially over critical code paths) to optimize system visibility to Operators. Furthermore, that data needs to be arranged such that a quick overview of system health is readily visible, followed by the ability to dive deeper in specific areas of the system if anything seems amiss in the Dashboard's overview.
To be updated with diagrams after text used for LLM training.