Introduction
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable
Why bother building a distributed system?
- Some applications are inherently distributed like the web
- Some applications require high availability and need to be resilient to single node failures like dropbox
- Some application need to tackle workloads that are just too big to fit on a single node
- Some applications have performance requirements that would be physically impossible to achieve with a single node
1.1 Communication
- Nodes must communicate over networks (e.g., browser-server interactions)
- Main challenges include:
- Message representation on the network
- Handling network outages
- Managing data integrity (bit flips)
- Ensuring communication security
1.2 Coordination
- Core Challenge: Coordinating nodes into a coherent system despite failures
- Fault = non-working component
- Fault-tolerance = system continues despite failures
- “Two Generals” Problem:
- Scenario: Two armies must coordinate attack time
- Communication: Via messengers who may be captured
- Challenge: Cannot achieve absolute certainty of coordination
- Even with multiple messages, 100% certainty impossible
- More messages increase confidence but never guarantee certainty
1.3 Scalability
- Key Performance Metrics:
- Throughput = Operations/second
- Response time = Client request to response duration
- Load = System-specific measurements like concurrent users, communication links, write/read ratio
- System Capacity:
- Maximum sustainable load before performance degradation
- The above graph shows the capacity marked by dotted line where performance peaks/plateaus
- Goodput = Successfully handled requests with acceptable response times
- Scaling Strategies:
- Scaling Up (Vertical)
- Buy better hardware
- Has physical limitations
- Scaling Out (Horizontal)
- Add more machines
- Main patterns:
- Functional decomposition
- Breaking down a system into smaller services based on their business functionality (e.g., splitting a monolith into auth service, payment service, etc.)
- Duplication
- Creating multiple identical copies of the same component to handle more load in parallel (also known as replication)
- Partitioning
- Dividing data or workload into smaller chunks and distributing them across multiple nodes (e.g., splitting users A-M on one server, N-Z on another)
- Functional decomposition
- Scaling Up (Vertical)
1.4 Resiliency
- Resiliency: System’s ability to function despite failures
- Availability: (Uptime / Total Time) x 100%
- Measured in “nines” (e.g. three nines = 99.9%)
- Four+ nines = highly available
- Failure Characteristics:
- Inevitable at scale
- Every component has failure probability
- Failures can be interdependent
- More components = Higher absolute failures
- Some more insights
- Scale amplifies failure probability
- More components = More potential failures
- Failures can cascade
- Engineering Approach
- Must embrace failure as inevitable
- Need paranoid mindset
- Always assume things will fail and proactively plan for every possible failure scenario, no matter how unlikely it may seem
- Similar to Murphy’s Law in engineering - “anything that can go wrong, will go wrong”
- Assess risks by:
- Failure likelihood
- Potential impact
- Mitigation strategies
- Redundancy
- Self-healing mechanisms
- Scale amplifies failure probability
1.5 Operations
- Evolution of Operations:
- Old: Separate dev and ops teams
- New: DevOps model (same team develops and operates)
- Being on-call helps understand system limitations
- Operational Requirements:
- Continuous safe deployments
- System observability
- Monitoring and alerts for SLO breaches
- Human intervention when needed
- Direct operational responsibility (”you build it, you run it”) leads to better system design and understanding
1.6 Anatomy of a distributed system
- Architectural Perspectives:
- Physical Layer
- Collection of machines
- Connected via network links
- Runtime Layer
- Software processes
- Communicate via IPC (like HTTP, HTTPS)
- Implementation Layer
- Loosely-coupled services
- Independently deployable/scalable
- Physical Layer
- Service Architecture
- Core Components
- Business logic
- Interfaces (Inbound/Outbound)
- Adapters
- Interface Types
- Inbound: Defines operations offered to clients
- Outbound: Defines external service communications
- Adapters
- Inbound: Part of API, handles incoming requests
- Outbound: Implements external service access
- Core Components
- A process running a service is referred to as a server
- A process that sends requests to a server is referred to as a client
- The above figure illustrates the architecture of a service in a distributed system. Lets break down its key components
- Core Components
- Center (Inner circle)
- Business Logic (central component)
- Surrounded by three interfaces (green dots)
- Service Interface
- Messaging Interface
- Repository Interface
- Outer Circle (Blue dots: Adapters)
- HTTP Controller: Handles incoming HTTP requests
- Kafka Producer: Manages message production
- SQL Adapter: Handles database operations
- Center (Inner circle)
- Flow Structure
- Inbound Flow
- HTTP Controller → Service Interface → Business Logic
- Outbound Flow
- Business Logic → Messaging Interface → Kafka Producer
- Business Logic → Repository Interface → SQL Adapter
- Inbound Flow
- This diagram effectively shows how adapters serve as bridges between the external world (HTTP, Kafka, SQL) and the internal business logic through well-defined interfaces.