Redundancy
Reliability engineering & safety science (von Neumann, Shannon; modern SRE and business continuity practice)

Redundancy means providing more than one way to achieve a required function. In series systems the weakest link fails the whole; redundancy converts the path to parallel, so other components, suppliers or people can take over. It’s different from buffers (time/stock) and best when backups are independent and regularly exercised so they’ll work under stress.
Patterns
- Active–active (parallel) – multiple units serve at once; one can disappear with no outage.
- Active–passive (standby) – secondary takes over on failure; classify as hot/warm/cold by readiness.
- 2N/N+1/quorum – full duplication (2N), one extra unit (N+1), or majority voting (quorum, RAID, consensus).
Independence & diversity – spread across vendors/regions/power/failure modes; add design diversity to avoid common-mode failure.
Reliability math (intuition) – series reliability multiplies (one failure kills); parallel succeeds if any path works.
Graceful degradation – non-essential features shed load to keep the core available.
People & process – cross-training, runbooks, and documentation raise the bus factor.
Data & backups – separate copies, media and locations; verify with restore tests.
SRE/IT – multi-AZ/region, load balancers, database replicas, circuit breakers.
Supply chain – dual-source critical inputs; safety stock at bottlenecks.
Operations – spare capacity, alternate routes, manual fallbacks.
Finance – liquidity buffers, diversified facilities, ring-fenced risk.
Org design – deputy roles, rota coverage, shared ownership of key knowledge.
Map the function and SPOFs – draw the value stream; mark single points of failure (tech, vendor, person, licence, site).
Choose a pattern per SPOF – N+1 for components, 2N for safety-critical, quorum for consensus systems, graceful degrade for peak load.
Ensure independence – separate clouds/regions/power feeds; vendor and design diversity where failure modes could correlate.
Instrument detection & switchover – health checks, timeouts, automated failover with manual override.
Drill it – scheduled game days and restore tests; rotate duties so backups stay warm.
Keep parity – config and data sync for standbys; prevent drift with automation.
Set service levels – target availability/MTTR; place redundancy where the impact or irreversibility is highest.
Review cost vs risk – model expected loss vs capex/opex; keep redundancy where it buys meaningful risk reduction.
Common-mode failure – “redundant” paths sharing a region, provider, library or process.
Bit-rot – cold backups decay; no one practices the switchover.
Split-brain & inconsistency – unsynchronised replicas; design clear leadership/quorum rules.
Complexity tax – more parts mean more failure modes; keep designs simple and observable.
False comfort – redundancy without detection, automation, or runbooks.
Security surface – extra endpoints and creds expand attack surface; pair with controls.
