Skip to main content

Intro to Distributed Systems

ยท 4 min read
Fernando Ramirez
Software Engineer @ Microsoft

High-level topics in distributed systems. Notes gathered from MIT 6.824: Distributed Systems, Lecture 1: Introduction.

What is a Distributed System?โ€‹

A set of co-operating computers that are communicating with each other over a network to get some coherent task done.

Why use Distributed Systemsโ€‹

  • Parallelism - You need to achieve high performance
  • Fault tolerance - In order to tolerate faults
  • Physical - Due to computer's physical location, it might be necessary to use e.g. banks
  • Security - Having computers communicating with each other in isolation is better for security
note

If you can solve a problem on a single computer without building a distributed system, you should do it that way.

Important Infrastructure in Distributed Systemsโ€‹

  • Storage, e.g. databases
  • Communication, e.g. RPC
  • Computations, e.g. MapReduce

We'd like to build abstractions that hide the distributed nature of these systems.

Scalabilityโ€‹

Usually, the high-level goal of building distributed systems is to get what people refer to as scalable speed-up.

Example: 2x web servers yields 2x throughput

This is what we want, we can just buy more web servers and get more throughput. The alternative to this is to pay programmers to fix your code to go faster, this is more expensive.

Types of scalabilityโ€‹

  • Vertical scaling: Adding resources to a single server to increase its processing power.
  • Horizontal scaling: Adding more machines or computing nodes to distribute the workload.

Fault Toleranceโ€‹

Typically, a single computer can often stay up for multiple years without crashing, however if you're building systems out of 1,000s of computers, then your computers will almost certainly crash.

"Big scale turns problems from very rare events you really don't have to worry about that much into constant problems".

Types of fault tolerancesโ€‹

  • Availability: Any client making a request for data gets a response, even if one or more nodes are down. A good available system will be recoverable as well.
  • Recoverability: The ability of the system to manage and understand failures while minimizing disruption to business operations

Helpful tools for building fault tolerant systemsโ€‹

  • Non-volatile storage: An important tool when it comes to building fault tolerant systems is non-volatile storage (hard drives, SSD's) if system crashes or power failures you can store a checkpoint or a log to track the state of the system before the power outage. Once the power is back you can use this last checkpoint to continue where you left off.
note

Non-volatile storage tends to be expensive to update, so it's best to avoid it when possible.

  • Replication: The management of replicated copies is tricky, we can have 2 servers each with a supposed identical copy of the system state, the key problem is that these 2 servers might go out of sync and will stop being replicas. This is the key problem in any replicated system

Consistencyโ€‹

Exampleโ€‹

Suppose you have a replicated key-value storage service (2 in total). This service allows 2 operations:

  • put(key, value)
  • get(key)

Given that this service is replicated, we'll gave two servers each with it's own table:

Server 1 contains the following table:

keyvalue
120

Server 2 contains the following table:

keyvalue
120

The problem could arise when a client makes a put operation to update the value to key 1 to 21.

Server 1 table:

keyvalue
121

But before we can update server 2, our system crashes, leaving us with inconsistent values.

Types of consistencyโ€‹

  • Strong consistency: Using the same key-value storage example as before, an example of strong consistency would be for any get operation, we return the value put by the most recently completed put operation.
  • Weak consistency: Sometimes it's very useful that have weaker consistency that does not guarantee the above mentioned. The reason for this is that strong consistency is a VERY expensive spec to implement.
note

There are lots of academic and real-world research on how to structure weak consistent so that they're actually useful to applications.


Distributed systems are cool ๐Ÿ˜Ž