Chapter 3 of 4

Distributed Data

Replication, partitioning, and the fundamental challenges of distributing data across multiple machines.

Key Insights

💡KEY INSIGHT

Replication and partitioning are the two main strategies for distributing data. You usually need both.

💡KEY INSIGHT

The CAP theorem is often misunderstood — it's really about the tradeoff between consistency and availability during network partitions.

💡KEY INSIGHT

Distributed transactions are hard. Two-phase commit works but has serious performance and availability costs.

Notes

📘CONCEPT

Replication Strategies

Single-leader: One node accepts writes, replicates to followers. Simple but the leader is a bottleneck. Multi-leader: Multiple nodes accept writes. Good for multi-datacenter setups but introduces write conflicts. Leaderless: Any node can accept reads/writes (Dynamo-style). Uses quorum reads/writes (R + W > N) for consistency.

📘CONCEPT

Partitioning (Sharding)

Split data across multiple nodes so each handles a subset. Key-range partitioning: Good for range queries, risk of hot spots. Hash partitioning: Even distribution, but range queries require scatter-gather. The hardest part is rebalancing when you add or remove nodes.

⚠️WARNING

The Dangers of Eventual Consistency

In eventually consistent systems, different nodes may return different values for the same key at the same time. This can lead to subtle bugs: reading your own writes might fail, causally related events might appear out of order, and concurrent writes might be silently lost.

EXAMPLE

Split-Brain Problem

In a single-leader setup, if the leader appears to fail, a follower gets promoted. But if the old leader was actually still alive (network partition), you now have two leaders accepting writes — a 'split brain.' This can cause data loss and corruption. Solutions include fencing tokens and consensus algorithms.

Quotes

💬QUOTE
The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or repair.

— Page 275

💬QUOTE
Network problems are surprisingly common, even in controlled datacenter environments.

— Page 279