Event Driven Architecture

December 22, 2023

Overview

Event-driven architecture describes systems at the scope of a cluster of microservices. Services within an event-driven architecture are typically either producers or consumers and there is typically a message broker between them to store these messages so that consumers can scale reactively without risk of losing messages.

Event-driven systems typically excel at reliability and scalability at the expense of some latency. In fact, all distributed systems suffer from increased latency at the expense of modularity. This is due to the fact that communication between machines in the same data-center (and even in the same machine, but different virtual machines) typically incurs a latency cost on the order of milliseconds.

There exists tremendous subtlety between:

Asynchronous event-driven producers that do not use message brokers
Synchronous event-driven producers that do not use message brokers
Synchronous request/response clients

There are trade-offs and exceptions to the generalizations of event-driven architecture mentioned above that exist in these rare manifestations. This post will clarify these subtle distinctions.

Asynchronous

Event-driven systems are primarily asynchronous where messages are sent in a non-blocking manner without a response that is used for further processing.

Pub/Sub with Message Broker

In larger systems, this asynchronous communication is usually complimented by a message broker that sits between producers and consumers. Message brokers provide durability, reliability, and decoupling.

Durability: Messages can be stored on the message broker indefinitely (usually ~14 days) as they typically use persistent storage to store messages (rather than memory).
Reliability: Entire scaling groups of consumers can be lost for the duration of message storage in the broker without losing a single message. Reliability is a product of durability.
Decoupling: Producers don't need to know about consumers, their location or capacity.

Use Cases

The use cases for message brokers like Apache Kafka or Amazon SQS are very broad. Any system that requires the ability to scale to high throughput reactively while guaranteeing message delivery can benefit from adopting these technologies. Learn more at Throughput vs Latency. This is especially true if they can afford slightly higher latency. How much higher depends entirely on the architecture of the system as a whole. Benchmarks do exist which often point to Kafka's ability to keep latencies low despite high throughput (such as this blog post Benchmarking Apache Pulsar, Kafka, and RabbitMQ). However, these benchmarks are highly synthetic and real-world performance can only match those figures in a system that is architected with the same level of simplicity. This may hold true at the very start of a project, but as a system expands to hundreds of microservices and each hop between event-driven services passes through this message broker, it adds to the overall latency of a single request's journey through the system.

Pub/Sub without Message Broker

In smaller systems, implementing Pub/Sub without a message broker can be challenging, but could be the right choice. The increased difficulty comes from the fact that producers cannot produce messages indiscriminately -- they must be aware of the capacity of the consumers. They must also be aware of their location. This may appear simple on the surface, but it adds overhead to maintain this list and update it as consumer groups are added and removed. This overhead would be even greater unless consumer groups have load balancers to maintain individual consumers in those groups (and their inherent failures over time).

Asynchronous event-driven producers that do not utilize message brokers do not wait for a response before considering that task complete and moving on to the next. This means that they do not guarantee at-least-once delivery. Categorizing its behavior as asynchronous event-driven without a message broker does not depend on the producer's use of an asynchronous library. It depends on the producer considering that task as complete immediately after sending the message to the consumer(s), rather than waiting for a response or confirmation of receipt. This mechanism is becoming less common over time as hardware becomes cheaper and smaller, making use of message brokers an easier choice.

This architecture is often backed by UDP.

Use Cases

The use case for this uncommon setup is in smaller systems where resources are scarce. Embedded systems and IoT devices fall under this category. For example, imagine a network of N sensors in a home that produce telemetry to M base devices. The overhead required to store messages persistently in a message broker could make the base devices prohibitively expensive, or require a physical increase in size that would lower its marketability.

Another use case is in scenarios where latency is a greater concern. Removing the overhead of a message broker would reduce latency. However, it's important to remember that a greater importance on low latency requires more co-location between any node that is a member of the network. At the extreme of this spectrum is avoiding distributed systems altogether and co-locating any machine that requires external communication with its counterpart as is common in High-Frequency Trading.

Synchronous

Synchronous communication in event-driven services is rare. It is very similar to synchronous request/response architecture, with an important distinction: In synchronous event-driven mechanisms, producers send a request, wait for a response, but don't continue with any further processing. The task is complete once a response is received.

There's an important distinction between synchronous event-driven systems and asynchronous event-driven systems that use message brokers. Producers of asynchronous event-driven systems that use message brokers may wait for a response from the message broker before considering the task as complete, however, because it doesn't wait for acknowledgement from the consumers, the flow is considered asynchronous.

Another important distinction exists between synchronous event-driven systems and request-response. In request-response, subsequent requests to other services in a network occur with the previous request remaining open. This chain of open requests can span several services in large distributed systems until the final request completes and the chain unwraps. A critical consideration is that if all services' request timeouts are equal, then the first request's timeout is the only one that matters. In synchronous event-driven systems, requests are passed and closed from one service to the next.

Use Cases

Use cases for synchronous event-driven systems are when:

the resource overhead of a message broker cannot be afforded, and
delivery of messages needs to be guaranteed, and
further processing is not required once a response is received

Examples of this are embedded systems, IoT (where each data point being transmitted must be recorded), and robotics (where a robotic arm might use a single serial communication bus to pass events through each joint that contains a dual-core controller alongside its actuator).

Conclusion

The main distinctive attribute of Event-Driven Systems is that each service's task is complete as soon as it hands off a message to the next service. It doesn't wait for a response to then continue more processing after it receives (or doesn't receive) a response. This latter example is Request/Response architecture, which itself has many sub-categories. However, this is not to say that an Event-Driven System can't have some elements of Request/Response. Different architectures complimenting each other in a system is quite common.

Updated: 2023-12-29. More updates expected with table, diagrams, and dynamic priority matrix.