Rate-limiting is a concept that exists all around us in every aspect of our lives. Most human-driven and human-served processes have some sort of rate limiting, since an infinite number of people and resources don’t exist. Rate-limiting exists outside of computing in the real world in the scope of engineering, for example, with flow regulators in pipelines. Within computing, rate-limiting is synonymous with distributed computing and I/O, but there are non-I/O contexts as well. For example, some systems might limit the frequency of memory that an application or process can request within a given time frame. Within the scope of distributed computing, rate-limiting limits either synchronous or asynchronous network requests.
Rate-limiting asynchronous requests carries less risk than synchronous requests, because of the lack of a short timeout duration. Asynchronous processes typically have much longer timeouts and much larger queues to hold messages. The rate-limiting is usually implicitly implemented by the maximum resources allowed to consume requests from the queue.
Rate-limiting synchronous requests carries additional challenges compared to asynchronous requests. Servers can only hold a certain number of open connections at the same time. Furthermore, a client will only wait so long, and retry so many times before giving up and returning an error.
Aggressive retry policies can contribute to rate-limits being reached earlier than anticipated. It’s important to engage in retries only if the error is retryable. If the errors are retryable, then this is a symptom rather than the problem — the problem being the cause of retryable errors.
Batching, aggregating, and reducing granularity of data is one way to handle more data in a data pipeline without increasing its physical bandwidth. An important decision to make is whether rate-limiting should be applied to the number of batches, or the number of items within the batches. Choices like this can have negative effects that cascade upwards if resources for processing are not sufficient downstream.
Distributed Computing “Call Stack”
In a typical program, whenever a function is invoked that, in turn, calls another function (and potentially even more nested function calls thereafter), that first function call is still open and remains on the call stack until all the nested function calls return. A similar phenomenon occurs in distributed computing as network calls flow through microservices. The edge service that receives that initial request maintains that network connection for the entirety of all nested network requests to neighboring services. This is an important consideration to keep in mind when implementing rate limiting — the load on the edge service, as well as the variety, frequency, and latency of request journeys.
There’s no actual call stack here, hence the double quotes in the title of this heading indicating that it is a misnomer. The term is being used due to the similarities noted with function call stacks.
An alternative to server-side rate-limiting that is especially useful if a system’s clients are highly tech-literate is to provide SLAs to them and have the client handle their own rate limiting.
Rate-limiting is typically something someone would only introduce to a system once it reaches some level of maturity, given that traffic is required to make use of it.
Sam Malayek works in Vancouver for Amazon Web Services, and uses this space to fill in a few gaps. Opinions are his own.