Retries (Retry Patterns)
September 17, 2023
Overview
Retries are exactly what it sounds like -- retrying an operation after a failure. Their use case is predominantly in the
context of I/O operations, and intrinsic to distributed computing where network communication is inherently unreliable. However,
there are other less common use cases as well. One such example might be a Multitasking Task Scheduling system in a memory constrained environment. If a
higher-priority task preempts a lower-priority one and uses up its memory, the lower-priority task might need to be retried
once memory is freed up.
Below are the most important concepts to learn as an introduction to Retries:
Exponential Back-off
Predominantly in Distributed Systems
Retrying a request or task immediately after a failure can compound problems if the issue is some form of resource contention. A back-off in between retries helps to provide the neighboring service time to complete its existing tasks so that resources can become free for new ones.
Jitter
Predominantly in Distributed Systems
Jitter is used to add randomness to the exact time when the next retry takes place. This is useful in large-scale distributed systems where many instances running the same code might need to retry a request to a neighboring service at the exact same time. If all the instances attempted to make what would essentially be a synchronized barrage of requests, this could lead to resource contention for the neighboring service, resulting in dropped requests.
Limit Number of Retries
It's important to limit the number of retries to prevent infinite loops of retries that exhaust resources or block further processing. What happens when this limit is reached? This all depends on the use case. It might make sense to use a Dead-Letter Queue.
Limits in exponential back-offs aren't always defined as the maximum number of retries. They could also be defined as maximum total duration for retrying. The latter is useful when the process's duration is variable and it's critical that the total duration of these attempts conclude before a certain time so that another process can begin. This is common in periodic (for example, hourly) processing.
Retriable Errors
Not all errors can be retried. For example, if a user submitted a bad request, retrying the same bad request will result in another error.
Whether an error can be retried is up to the designer of the system to decide, along with the mechanism for implementing retries.
A basic way to accomplish this in Java, for example, is for the retryable exception classes of a system to implement a Retryable interface.
Then, before attempting a retry, you check whether an exception is instanceof Retryable
.
Dead-Letter Queues
Dead-Letter Queues are a queue of messages (may represent tasks) that could not be delivered or completed. How these queues are processed depends on the use case. Some options include:
- Retrying the messages at a later time.
- Alerting human operators to decide how to handle them.
- Archiving them.
Monitor and Iterate
It's important to publish metrics for your retries. This helps you fine-tune its parameters, such as its initial back-off duration, rate of exponential growth of duration for the subsequent retries, and maximum number of retries. First implementations can be good enough, but they're rarely perfect on the first attempt.
Idempotency
Retrying operations creates a risk of duplicating actions. This risk is mitigated using idempotency. No matter how many times an operation is repeated, subsequent attempts should have no effect, but should return the same response as the first successful attempt. A common mechanism to accomplish this is using Request or Session IDs that are saved in a DB table that is queried before any mutating operation begins. The Request IDs could be random, or they could be deterministically created from various fields that distinguish one task from any other.