Chapter 3 of 4
Distributed Data
Replication, partitioning, and the fundamental challenges of distributing data across multiple machines.
Key Insights
Replication and partitioning are the two main strategies for distributing data. You usually need both.
The CAP theorem is often misunderstood — it's really about the tradeoff between consistency and availability during network partitions.
Distributed transactions are hard. Two-phase commit works but has serious performance and availability costs.
Notes
Replication Strategies
Single-leader: One node accepts writes, replicates to followers. Simple but the leader is a bottleneck. Multi-leader: Multiple nodes accept writes. Good for multi-datacenter setups but introduces write conflicts. Leaderless: Any node can accept reads/writes (Dynamo-style). Uses quorum reads/writes (R + W > N) for consistency.
Partitioning (Sharding)
Split data across multiple nodes so each handles a subset. Key-range partitioning: Good for range queries, risk of hot spots. Hash partitioning: Even distribution, but range queries require scatter-gather. The hardest part is rebalancing when you add or remove nodes.
The Dangers of Eventual Consistency
In eventually consistent systems, different nodes may return different values for the same key at the same time. This can lead to subtle bugs: reading your own writes might fail, causally related events might appear out of order, and concurrent writes might be silently lost.
Split-Brain Problem
In a single-leader setup, if the leader appears to fail, a follower gets promoted. But if the old leader was actually still alive (network partition), you now have two leaders accepting writes — a 'split brain.' This can cause data loss and corruption. Solutions include fencing tokens and consensus algorithms.
Quotes
“The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or repair.”
— Page 275
“Network problems are surprisingly common, even in controlled datacenter environments.”
— Page 279