Introduction to Unreliable Networks
Distributed systems rely heavily on networks for node-to-node communication. However, networks are inherently untrustworthy—messages may be delayed, lost, duplicated, or even delivered out of order. Designing reliable distributed systems requires acknowledging and overcoming these inconsistencies, especially in asynchronous environments where guarantees around timing and delivery don’t exist.
Common Network Faults in Practice
Computer networks have been a focus of optimization for decades, yet they remain fundamentally unreliable. Key problems include:
- Packet Loss: Data packets may be dropped due to congestion, misconfiguration, or physical link failures.
- Delayed Responses: Network queues and overloaded links can dramatically increase response times.
- Node Crashes: A requesting or responding node might fail while in the middle of an exchange.
- Network Partitions: Certain parts of the network become isolated entirely, unable to communicate with the rest of the system.
Modern datacenters experience network disruptions regularly. A medium-sized datacenter may report a dozen faults each month, with these faults impacting anything from an individual machine to an entire rack.
Handling Unreliable Networks with Protocols
Distributed systems handle unreliable networks by employing multiple layers of error-handling mechanisms.
Error Correction and Detection
Techniques like error-correcting codes enable networks to recover from minor physical-layer faults. For example, data transmitted over wireless networks with minor interference is corrected using parity bits and other mechanisms.
Protocols for Network Reliability
- TCP (Transmission Control Protocol): TCP handles packet retransmissions, duplicate elimination, and reordering, offering a reliable transport layer over an unreliable Internet Protocol (IP). While it solves many problems at the network level, it cannot address high-level concerns like application timeouts or unbounded delays.
- UDP (User Datagram Protocol): For latency-sensitive applications like video streaming, UDP avoids retransmissions, sacrificing reliability for speed.
Challenges of Asynchronous Networks
In asynchronous network environments, the sender rarely knows the fate of its message:
- Was the request lost?
- Did the recipient crash, or is it just overloaded?
- Was the reply message delayed or dropped?
A sender waiting indefinitely for a response might find nothing but silence, rendering fault diagnosis nearly impossible.
Timeouts: A Practical Workaround
Timeouts allow systems to abort ill-fated attempts and retry. However, timeout-based detection is itself imperfect:
- A request might still succeed even when the sender gives up on waiting.
- Premature timeouts could misclassify a slow node as a failed one, triggering unnecessary recovery mechanisms that amplify load imbalance.
Adjusting timeout duration requires careful experimentation, factoring in network variability and application needs.
Network Congestion and Queueing
Queueing delays occur when multiple nodes attempt to route traffic through a bottlenecked link simultaneously:
- Packets build up in queues at network switches, waiting their turn for transmission. In extreme cases, queues fill up entirely, resulting in packet drops.
- At the receiving end, incoming requests are queued until the application can process them. Overloaded servers amplify delays, further creating a vicious cycle.
Effective congestion avoidance strategies include flow control protocols (like TCP’s backpressure mechanism) and resource monitoring systems that evenly distribute workloads.
Building a Reliable System from Unreliable Networks
Engineers can build robust systems atop unreliable networks using the following approaches:
- Retry Logic with Idempotency
- Systems must gracefully handle repeated requests, ensuring the same operation is not performed multiple times during retries.
- Quorum Systems
- Distributed databases often use quorum writes and reads, where operations succeed if completed by a majority of replicas, negating dependency on any single node’s response.
- Monitoring and Failure Simulations
- Regular stress tests and tools like Chaos Monkey mimic network faults to uncover system weaknesses, ensuring adequate recovery logic is in place.
Conclusion
Unreliable networks are a reality in distributed systems, but by leveraging well-designed protocols and thoughtful error-handling mechanisms, engineers can achieve resilience and reliability. While perfect reliability remains elusive, robust designs ensure networks perform seamlessly in high-demand and failure-prone environments. Handling unpredictability is a cornerstone of successful distributed systems.