Introduction

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable

Why bother building a distributed system?

Some applications are inherently distributed like the web
Some applications require high availability and need to be resilient to single node failures like dropbox
Some application need to tackle workloads that are just too big to fit on a single node
Some applications have performance requirements that would be physically impossible to achieve with a single node

1.1 Communication

Nodes must communicate over networks (e.g., browser-server interactions)
Main challenges include:
1. Message representation on the network
2. Handling network outages
3. Managing data integrity (bit flips)
4. Ensuring communication security

1.2 Coordination

Core Challenge: Coordinating nodes into a coherent system despite failures
- Fault = non-working component
- Fault-tolerance = system continues despite failures
“Two Generals” Problem:
- Scenario: Two armies must coordinate attack time
- Communication: Via messengers who may be captured
- Challenge: Cannot achieve absolute certainty of coordination
- Even with multiple messages, 100% certainty impossible
More messages increase confidence but never guarantee certainty

1.3 Scalability

Key Performance Metrics:
- Throughput = Operations/second
- Response time = Client request to response duration
- Load = System-specific measurements like concurrent users, communication links, write/read ratio
System Capacity:
- Maximum sustainable load before performance degradation
- The above graph shows the capacity marked by dotted line where performance peaks/plateaus
- Goodput = Successfully handled requests with acceptable response times
Scaling Strategies:
- Scaling Up (Vertical)
  - Buy better hardware
  - Has physical limitations
- Scaling Out (Horizontal)
  - Add more machines
  - Main patterns:
    - Functional decomposition
      - Breaking down a system into smaller services based on their business functionality (e.g., splitting a monolith into auth service, payment service, etc.)
    - Duplication
      - Creating multiple identical copies of the same component to handle more load in parallel (also known as replication)
    - Partitioning
      - Dividing data or workload into smaller chunks and distributing them across multiple nodes (e.g., splitting users A-M on one server, N-Z on another)

1.4 Resiliency

Resiliency: System’s ability to function despite failures
Availability: (Uptime / Total Time) x 100%
- Measured in “nines” (e.g. three nines = 99.9%)
- Four+ nines = highly available
Failure Characteristics:
- Inevitable at scale
- Every component has failure probability
- Failures can be interdependent
- More components = Higher absolute failures
Some more insights
- Scale amplifies failure probability
  - More components = More potential failures
  - Failures can cascade
- Engineering Approach
  - Must embrace failure as inevitable
  - Need paranoid mindset
    - Always assume things will fail and proactively plan for every possible failure scenario, no matter how unlikely it may seem
    - Similar to Murphy’s Law in engineering - “anything that can go wrong, will go wrong”
  - Assess risks by:
    - Failure likelihood
    - Potential impact
  - Mitigation strategies
    - Redundancy
    - Self-healing mechanisms

1.5 Operations

Evolution of Operations:
- Old: Separate dev and ops teams
- New: DevOps model (same team develops and operates)
- Being on-call helps understand system limitations
Operational Requirements:
- Continuous safe deployments
- System observability
- Monitoring and alerts for SLO breaches
- Human intervention when needed
Direct operational responsibility (”you build it, you run it”) leads to better system design and understanding

1.6 Anatomy of a distributed system

Architectural Perspectives:
- Physical Layer
  - Collection of machines
  - Connected via network links
- Runtime Layer
  - Software processes
  - Communicate via IPC (like HTTP, HTTPS)
- Implementation Layer
  - Loosely-coupled services
  - Independently deployable/scalable
Service Architecture
- Core Components
  - Business logic
  - Interfaces (Inbound/Outbound)
  - Adapters
- Interface Types
  - Inbound: Defines operations offered to clients
  - Outbound: Defines external service communications
- Adapters
  - Inbound: Part of API, handles incoming requests
  - Outbound: Implements external service access
A process running a service is referred to as a server
A process that sends requests to a server is referred to as a client
The above figure illustrates the architecture of a service in a distributed system. Lets break down its key components
Core Components
- Center (Inner circle)
  - Business Logic (central component)
  - Surrounded by three interfaces (green dots)
    - Service Interface
    - Messaging Interface
    - Repository Interface
- Outer Circle (Blue dots: Adapters)
  - HTTP Controller: Handles incoming HTTP requests
  - Kafka Producer: Manages message production
  - SQL Adapter: Handles database operations
Flow Structure
- Inbound Flow
  - HTTP Controller → Service Interface → Business Logic
- Outbound Flow
  - Business Logic → Messaging Interface → Kafka Producer
  - Business Logic → Repository Interface → SQL Adapter
This diagram effectively shows how adapters serve as bridges between the external world (HTTP, Kafka, SQL) and the internal business logic through well-defined interfaces.