Faults and Partial Failures in Distributed Systems


Introduction to Faults and Failures

Distributed systems, by their nature, span multiple computers communicating over unreliable networks. Partial failures represent an inherent challenge: some components in the system may fail unpredictably, while others continue to function. Unlike a single-computer system, where failures result in predictable crashes, distributed systems operate amid a continuous mess of faults.

Faults are especially hard to handle in such systems because they are often nondeterministic—that is, failures and delays do not always behave consistently, making symptom diagnosis and programming more complex.


What are Partial Failures?

In a distributed environment, partial failures occur when some nodes or message paths fail while others function correctly. For instance:

  • A subset of servers in a data center might go offline due to network issues, even though others remain operational.
  • Failure of one power distribution unit (PDU) may leave part of a system functional while another part crashes entirely.

Real-World Anecdote

A hypoglycemic driver crashing into a data center’s HVAC system once left entire racks offline while others remained fine—a clear example of distributed systems’ unpredictable failure modes.


Why Partial Failures Are Hard

  1. Nondeterministic Network Behavior
    • Message delivery and acknowledgment may become delayed, lost, or duplicated irregularly, leading to ambiguous states where you can’t tell whether a request succeeded or failed.
  2. Dependency on Multiple Nodes
    • Operations across multiple machines can fail inconsistently, resulting in unpredictable outcomes.
  3. System Recovery Can Be Ambiguous
    • During partitions or outages, systems may need to make decisions based on incomplete information, raising additional challenges like determining whether a failure requires retries, rollbacks, or leadership changes.

Cloud Computing vs. Supercomputing: Contrasting Philosophies

Supercomputers

  • These systems often follow a stop-and-restart model:
    • Jobs periodically checkpoint their state to durable storage.
    • The entire system halts when faults occur, and recovery begins from the last consistent state after fixing the fault.
    • This prioritizes correctness over availability.

Cloud Computing

  • In contrast, cloud systems prioritize availability:
    • Fault tolerance must be built into the software because failures are treated as inevitable. Systems like cloud services need to remain operational even during node failures, network interruptions, or rolling upgrades.
    • Cloud components are often commodity hardware, making failures more frequent, but the software is designed to handle such imperfections resiliently.

Supercomputers’ rigid fault-handling mechanisms contrast with the adaptive, fault-tolerant strategies required in cloud environments, emphasizing the need to build reliable systems from inherently unreliable components.


Building Resilient Distributed Systems

  1. Design for Fault Tolerance
    • Components with built-in fault-tolerance mechanisms, such as quorum protocols and replication, can prevent a single failure from cascading into system-wide outages.
    • For example, rolling upgrades allow individual nodes to undergo maintenance without interrupting service continuity.
  2. Accept Continuous Failures
    • With thousands of nodes, the assumption is that something is always broken. Engineers often adopt tools like Chaos Monkey to induce controlled failures and ensure systems behave predictably under unexpected stress.
  3. Make Use of Commodity Hardware
    • In cloud environments, replacing failed nodes is quicker and cheaper than guaranteeing component reliability. Systems should anticipate failures and recover automatically, either by reallocating workloads or spinning up new instances.

    Example: If a virtual machine underperforms, operators terminate it and start a new instance without affecting ongoing operations.


Conclusion

Partial failures are a defining characteristic of distributed systems, and addressing them requires designing with pessimism and resilience in mind. Unlike single-computer environments where software assumes ideal conditions, distributed systems engineers must embrace uncertainty to handle everything from hardware breakdowns to network glitches.

By adopting fault-tolerant principles, dynamic recovery processes, and thorough testing strategies, distributed systems can continue to perform reliably—even when certain parts fail unpredictably. This adaptability lies at the heart of modern cloud-based and internet-scale systems.

Want to get blog posts over email?

Enter your email address and get notified when there's a new post!