What are Distributed Systems?
March 05, 2024
Overview
Distributed Systems are synonymous with microservices. They are networks of containers, virtual machines, or physical machines. As time has passed, it's more common to build applications in a way where its individual components are separate services that communicate with each other over a network, rather than as individual components within a monolith application. This network may be physical, or it may be virtual. Kubernetes is a framework that makes building a highly scalable, configurable, and efficient microservices application relatively easier than it otherwise would (at the scale and granularity offered by Kubernetes).
Distributed Systems can be observed in more than just modern applications. The internet is a Distributed System.
This blog post will provide an overview of the considerations necessary when building and operating Distributed Systems.
Table of Contents
Data Store Design
Choosing a data store for a Distributed System can be challenging. More services means more connections to databases. This can pose a challenge if there's only one data store. Even if
queries are handled through a connection pool to reduce the total number of connections, monolith data stores can easily become
a scaling bottleneck to larger Distributed Systems. If these queries have a large number of mutations (CREATE
, UPDATE
, DELETE
, INSERT
), then the chance of transactions
conflicting is higher. In these cases, it's up to the client (the service making the database mutation) to try again with the same
or different parameters. If these errors accumulate, it can easily have cascading negative performance impacts throughout the entire
cluster of services.
Potential Solutions to a Monolithic Data Store
Service-Specific SQL or NoSQL Data Stores + Saga Pattern
This solution resolves the problem of having a high number of connections to a single database, as well as the higher chance of database transaction failures caused by conflicting mutations. However, it doesn't eliminate this risk of increased transaction failures. This burden is offloaded to the application that is forced to handle relational mutations through a pattern such as the Saga Pattern. It's important to note that the Saga Pattern has significant overhead that potentially requires the addition of at least a whole additional service, along with a set of libraries to help coordinate all of this application-level transaction handling. The benefit is that this provides application-level control that can help to optimize how transactions are handled for the application's use cases.
Service-Specific SQL or NoSQL Data Stores + Saga Pattern
If complex relational constraints do not exist between services and their entities, it's simpler to treat each service and its data as its own entity. In this model, if a service needs information from another service, it simply sends a request to this other service's API. It may keep a cached version locally to lower latency of requests to its service.
Distributed SQL Data Store
A relatively new potential solution to the problem of a monolithic database for a Distributed System is a distributed SQL database like CockroachDB. CockroachDB can replicate both read and write nodes across the entire globe while maintaining consistency and handling transactions. The only downside is that a greater distribution of the database globally necessarily increases the latency of each transaction, thereby increasing the chance that it will conflict with another transaction initiated on the other side of the globe.
Granularity
From monolith to nanoservices, the choice for how small or large each service should be is explored in Microservices vs Monolithic.
Inter-Service Communication
The choice for communication between services is highly use-case dependant. It's very common for different groups of services within a Distributed System to have distinctly different communication patterns. Rather than exploring all the possible communication patterns, the choice needs to start with the customer and the use case for the feature. It's perfectly normal to build entirely unique communication mechanisms for a single feature. However, it's important to build it in a way so that this newly added communication mechanism can be scaled to multiple features. Below are some blog posts that touch on various communication patterns:
Retries
Retries are a critical implementation detail in a Distributed System. Network communication inherently experiences transient failures that may or may not depend on the actual traffic being experienced by either the client or server.
Debugging
Debugging Distributed Systems is nothing like debugging a monolithic system. It's much more difficult. During development, you don't have the luxury of a using an actual debugger to step into or step over individual operations in code (outside unit tests). Logging is your best friend when debugging, alongside other observability tools. Tracing is a nice-to-have that becomes difficult to manage when there's a lot of batching.
Testing
See Testing in Software Engineering to learn more about testing that most certainly affects Distributed Systems.
Ownership and Codebase
Maintaining ownership among developers is critical to maintaining high code quality. This means that services need to have owners that are required to approve new changes to that section of the codebase that belongs to that service. This assumes a monorepo, otherwise, each service would have a separate codebase and its own ownership.
Conclusion
The choice for building a distributed system is a common one today if the goals are:
- Flexible scaling.
- Fault tolerance and high availability.
- High throughput.
On the other hand, if low latency is the most important concern (as might be the case in HFT), then it's always more advisable to minimize network requests and build something monolithic.
To be updated with a section on deployments, and a closer look at the difficulties of implementing tracing with batching.
Updated: 2024-03-07