Chapter 1 of 4
Foundations of Data Systems
Reliability, scalability, and maintainability — the three pillars of good data system design.
Key Insights
A data-intensive application is one where data volume, complexity, or speed of change is the primary challenge — not raw CPU power.
Reliability means the system works correctly even when faults occur. Scalability means it handles growth gracefully. Maintainability means others can work on it productively.
There is no one-size-fits-all database. Understanding the tradeoffs between different tools is essential.
Notes
Reliability, Scalability, Maintainability
Reliability: The system should continue to work correctly even when things go wrong (hardware faults, software bugs, human errors). Scalability: As the system grows (data volume, traffic, complexity), there should be reasonable ways to deal with that growth. Maintainability: Over time, many people will work on the system — it should be easy to understand, modify, and extend.
Describing Load and Performance
Describe load with 'load parameters': requests per second, read/write ratio, cache hit rate, etc. Describe performance with percentiles (p50, p95, p99), not averages. A high p99 latency means 1 in 100 requests is slow — these are often your most valuable customers with the most data.
Twitter's Fan-Out Problem
Twitter faced a classic scalability challenge: when a user posts a tweet, it needs to appear in all followers' timelines. Approach 1: Query at read time (slow reads). Approach 2: Pre-compute timelines at write time (fan-out on write). Twitter uses a hybrid: fan-out for most users, but query-time for celebrities with millions of followers.
Averages Are Misleading
Don't use average response time to measure performance. Use percentiles instead. The p50 (median) tells you what a typical user experiences. The p99 tells you about worst-case scenarios. A system can have a great average but terrible tail latency affecting many users.
Quotes
“A fault is not the same as a failure. A fault is when one component deviates from its spec, whereas a failure is when the system as a whole stops providing the required service.”
— Page 6
“Even if you are working on a system that handles 'only' a few thousand requests per second, you should still think about its scalability.”
— Page 14