Distributed Systems

A distributed system is a system where the components(computers, servers, etc.) are located in a different network, and the data is stored in multiple nodes to provide a service.

The data system designer not only deals with application code but also combines several other tools to build the foundations for the service: caching, database, full-text search, etc.

Some of the challenges associated with distributed systems include network latency, partial failures, and concurrency of the components.

How can we guarantee data consistency across multiple nodes?
What to do in case one of the components of the system fails?

Consistency, Availability, and Partition Tolerance

The CAP theorem helps data system designers choose which guarantees of the system to prioritize.

Consistency means that all clients see the same data simultaneously, no matter which node they are connected to. For this to happen, whenever data is written to one node, it must be replicated to all the other nodes in the system before it can be considered successful.

Availability means that any client requesting data gets a response, even if one or more components are down.

Partition tolerance means that the cluster must continue to work despite communication breakdowns between nodes in the system.

MongoDB is a CP data store. It resolves network partitions by maintaining consistency while compromising availability.

The writes can be done only in the primary node. When the primary node is down, the node with the most recent log data gets elected. Once all the secondary nodes catch up, the cluster becomes available again.

Cassandra is an AP database. It has availability and partition tolerance at the cost of data consistency.

CAP

Reliable, Scalable, and Maitainable Applications

In general, microservices and distributed system architecture are more complex to develop and deploy than monolithic architecture. A system that can run on a single machine is often simpler, but high-end machines can become very expensive.

The advantage over monolithic is organizational scalability through loose coupling; different teams can work on different services. It is reliable, scalable, and maintainable.

Reliability

A reliable system should continue to work correctly even when things go wrong. The things that can go wrong are called faults. A system that anticipates and can cope with them is called fault-tolerant. Typical examples are hardware failure, bad inputs, a service that the system depends on that slows down, etc.

A fault is usually defined as one system component deviating from its spec, whereas a failure is when the system stops providing the required service to the user.

Scalability

A scalable system can handle an increased load by adding more resources, partitioning data and processing across multiple nodes, and having the ability to replicate data to ensure fault tolerance.

Maintainability

A maintainable system applies the three design principles: Operability, Simplicity, and Evolvability.

Operability: Makes routine tasks easy for operations teams;
Simplicity: it is more accessible to onboarding new engineers;
Evolvability: Engineers feel confident making changes to the system to adapt to unanticipated use cases as requirements change.

Closing thoughts

Working with a distributed system requires thinking about data modeling, data storage, and processing. The data system designer must also consider the product nature, as choosing a data model optimized for write-heavy loads might not be appropriate for read-heavy loads. The CAP theorem plays a considerable part when selecting data storage as well. Having this knowledge in your utility belt will help you make better decisions in the future.

References