Tail Latency

March 10, 2024

Overview

When measuring the latency of anything, Tail Latencies typically refer to the data points at the slow end of the distribution. Technically, either end of a distribution has a tail, but when the term Tail Latency is used in the context of software engineering, the slow end is being referenced. We're typically discussing the top 90th (i.e. P90), 95th (i.e. P95), and 99th (i.e. P99) percentiles. Many would argue that it's only referring to the P99. Focus is often on these metrics because the worst cases of anything are the most likely to cause problems.

Latency Distribution

Reducing Tail Latencies

Observability

Conclusion and Cost

Latency Distribution

The distribution of latencies is highly dependent on the thing being measured, for example:

Simple request / response service that sits in front of a database.
Asynchronous request that passes through a message broker.
Asynchronous request to a worker that does some processing for several minutes.

The shape of the distribution that can be expected is explored in: Latency distribution of N parallel tasks.

Some more practical examples: Examples of latency distributions on a wired network.

The expected distributions in the above links are important guidelines for what to expect, but they are highly synthetic and should be taken with a grain of salt. Your system needs to be performance tested to assure that it can meet its demand. See Load vs Stress Testing for more information.

Reducing Tail Latencies

Tail Latencies that are longer than desired are often the result of congestion of some kind that only affects a small percentage of requests. This is in contrast to a situation where all data points are slow because the actual work for each individual data point is too much. This latter scenario often requires a complete re-architecture of the workload. Our long Tail Latency scenario often does not. Congestion in this context often suggests over-utilization of some resources that results in under-utilization of other resources. Some ideas:

It likely suggests an infrastructure-level problem, rather than application-level problem.
It could point to a situation that only occurs on some interval.
It might be the result of some resource-sharing.

A simple way to diagnose the source of congestion is to examine logs and their timestamps in each junction (e.g. service) along its journey. This may be enough to narrow the source to a service or sub-cluster of services. However, logs at the application-level may not be enough to ascertain the direct cause of the problem. This is where service-level metrics become invaluable to examine the following during the time of these undesirable long latencies:

CPU
Memory
Connections
- Number of active connections of the target service
- Distribution of latency to establish a connection with the target service (poor distribution here could indicate the need for connection pooling)

As well as metrics for any surrounding infrastructure, such as:

Message broker queues
Centralized caches

It's critical to isolate as many components and processes to as much granularity as possible.

Observability

Given that visibility is so critical to reducing tail latencies, it's worth mentioning observability in this blog post. One of the most important locations to gather data are the ingress and egress points on your service. Creating wrappers around your client and server libraries can provide key locations in your codebase to emit metrics:

Learn more about the concept of interceptors in gRPC.
Learn about gRPC vs REST.
Learn more about Observability.

Conclusion and Cost

Low tail latencies are a luxury. It suggests some level of overcapacity given that one of the main causes of long tail latencies is acute congestion. The classic 80/20 Pareto Principle applies here just as it does anywhere. Optimizing for that P99 is expensive. You're providing architectural or capacity adjustments simply to improve 1% of your workload which necessarily suggests relatively more marginal cost than gain. Exact numbers are impossible to produce and highly use case dependant.

To be updated with a closer look at cases that cause acute congestion and how to produce metrics and diagnose problems.

Overview

Table of Contents

Latency Distribution

Reducing Tail Latencies

Observability

Conclusion and Cost