System Models
This chapter introduces the concept of system models, which are used to define assumptions about the behavior of nodes, communication links, and timing in distributed systems. System models help simplify the complex realities of distributed systems, allowing for easier reasoning and algorithm design. The sources provide examples of models for communication links and node behavior:
Communication Link Models
- Fair-loss link model: Assumes messages can be lost or duplicated, but if the sender keeps retransmitting, the message will eventually reach the destination. This model reflects the inherent unreliability of real-world networks.
- Stubborn link model: Builds on the fair-loss model by adding the assumption that messages can also be reordered.
- Perfect link model: Assumes messages are never lost, duplicated, or reordered. This model, while unrealistic, can be helpful for understanding basic concepts before introducing the complexities of real-world networks.
It's important to note that even reliable communication protocols like TCP, which guarantee in-order delivery without loss or duplication, are built on top of unreliable lower-level protocols like IP.
Node Failure Models
- Crash-stop fault model: Assumes nodes can only fail by crashing and remain crashed until they are restarted.
- Crash-recovery fault model: Assumes nodes can crash but may recover with their previous state intact. This model acknowledges that nodes can be restarted or recover from failures.
- Arbitrary-fault model (Byzantine model): Assumes nodes can deviate from their intended behavior in unpredictable ways, including crashes, unexpected behavior due to bugs, or even malicious activity. This model represents the most challenging scenario, where nodes can exhibit faulty behavior beyond simple crashes.
Timing Models
- Synchronous system model: Assumes strict bounds on message delays and processing times.
- Asynchronous system model: Makes no assumptions about timing, allowing for unbounded message delays and processing times.
- Partially synchronous system model: Falls between synchronous and asynchronous models, assuming periods of synchrony interspersed with periods of asynchrony.
Assumptions in the Sources
The sources primarily operate under a system model characterized by:
- Fair-loss links: This is the most realistic model for communication links, reflecting the potential for message loss and duplication in real-world networks.
- Crash-recovery node behavior: Nodes are assumed to be able to recover from crashes, which is a common expectation in distributed systems.
- Partial synchrony: The timing model allows for both periods of predictable timing and periods of unbounded delays.
Model Limitations
While models are helpful for simplifying the complexities of distributed systems, they are still abstractions of reality, and abstractions can sometimes leak. As a reader, it's essential to:
- Question the models’ assumptions: Consider how the algorithms presented in the sources might break down if the underlying assumptions are violated.
- Recognize the limitations of models: Be aware that real-world systems can exhibit behaviors not fully captured by any model.
By understanding the strengths and limitations of system models, readers can gain valuable insights into the challenges and complexities of designing and implementing distributed systems.