Intro to Distributed Systems

July 6, 2024 · 4 min read

Fernando Ramirez

Software Engineer @ Microsoft

High-level topics in distributed systems. Notes gathered from MIT 6.824: Distributed Systems, Lecture 1: Introduction.

What is a Distributed System?

A set of co-operating computers that are communicating with each other over a network to get some coherent task done.

Why use Distributed Systems

Parallelism - You need to achieve high performance
Fault tolerance - In order to tolerate faults
Physical - Due to computer's physical location, it might be necessary to use e.g. banks
Security - Having computers communicating with each other in isolation is better for security

note

If you can solve a problem on a single computer without building a distributed system, you should do it that way.

Important Infrastructure in Distributed Systems

Storage, e.g. databases
Communication, e.g. RPC
Computations, e.g. MapReduce

We'd like to build abstractions that hide the distributed nature of these systems.

Scalability

Usually, the high-level goal of building distributed systems is to get what people refer to as scalable speed-up.

Example: 2x web servers yields 2x throughput

This is what we want, we can just buy more web servers and get more throughput. The alternative to this is to pay programmers to fix your code to go faster, this is more expensive.

Types of scalability

Vertical scaling: Adding resources to a single server to increase its processing power.
Horizontal scaling: Adding more machines or computing nodes to distribute the workload.

Fault Tolerance

Typically, a single computer can often stay up for multiple years without crashing, however if you're building systems out of 1,000s of computers, then your computers will almost certainly crash.

"Big scale turns problems from very rare events you really don't have to worry about that much into constant problems".

Types of fault tolerances

Availability: Any client making a request for data gets a response, even if one or more nodes are down. A good available system will be recoverable as well.
Recoverability: The ability of the system to manage and understand failures while minimizing disruption to business operations

Helpful tools for building fault tolerant systems

Non-volatile storage: An important tool when it comes to building fault tolerant systems is non-volatile storage (hard drives, SSD's) if system crashes or power failures you can store a checkpoint or a log to track the state of the system before the power outage. Once the power is back you can use this last checkpoint to continue where you left off.

note

Non-volatile storage tends to be expensive to update, so it's best to avoid it when possible.

Replication: The management of replicated copies is tricky, we can have 2 servers each with a supposed identical copy of the system state, the key problem is that these 2 servers might go out of sync and will stop being replicas. This is the key problem in any replicated system

Consistency

Example

Suppose you have a replicated key-value storage service (2 in total). This service allows 2 operations:

put(key, value)
get(key)

Given that this service is replicated, we'll gave two servers each with it's own table:

Server 1 contains the following table:

key	value
1	20

Server 2 contains the following table:

key	value
1	20

The problem could arise when a client makes a put operation to update the value to key 1 to 21.

Server 1 table:

key	value
1	21

But before we can update server 2, our system crashes, leaving us with inconsistent values.

Types of consistency

Strong consistency: Using the same key-value storage example as before, an example of strong consistency would be for any get operation, we return the value put by the most recently completed put operation.
Weak consistency: Sometimes it's very useful that have weaker consistency that does not guarantee the above mentioned. The reason for this is that strong consistency is a VERY expensive spec to implement.

note

There are lots of academic and real-world research on how to structure weak consistent so that they're actually useful to applications.

Distributed systems are cool 😎

What is a Distributed System?​

Why use Distributed Systems​

Important Infrastructure in Distributed Systems​

Scalability​

Types of scalability​

Fault Tolerance​

Types of fault tolerances​

Helpful tools for building fault tolerant systems​

Consistency​

Example​

Types of consistency​

What is a Distributed System?

Why use Distributed Systems

Important Infrastructure in Distributed Systems

Scalability

Types of scalability

Fault Tolerance

Types of fault tolerances

Helpful tools for building fault tolerant systems

Consistency

Example

Types of consistency