Chapter 1 of 4

Foundations of Data Systems

Reliability, scalability, and maintainability — the three pillars of good data system design.

Key Insights

💡KEY INSIGHT

A data-intensive application is one where data volume, complexity, or speed of change is the primary challenge — not raw CPU power.

💡KEY INSIGHT

Reliability means the system works correctly even when faults occur. Scalability means it handles growth gracefully. Maintainability means others can work on it productively.

💡KEY INSIGHT

There is no one-size-fits-all database. Understanding the tradeoffs between different tools is essential.

Notes

📘CONCEPT

Reliability, Scalability, Maintainability

Reliability: The system should continue to work correctly even when things go wrong (hardware faults, software bugs, human errors). Scalability: As the system grows (data volume, traffic, complexity), there should be reasonable ways to deal with that growth. Maintainability: Over time, many people will work on the system — it should be easy to understand, modify, and extend.

📘CONCEPT

Describing Load and Performance

Describe load with 'load parameters': requests per second, read/write ratio, cache hit rate, etc. Describe performance with percentiles (p50, p95, p99), not averages. A high p99 latency means 1 in 100 requests is slow — these are often your most valuable customers with the most data.

✅EXAMPLE

Twitter's Fan-Out Problem

Twitter faced a classic scalability challenge: when a user posts a tweet, it needs to appear in all followers' timelines. Approach 1: Query at read time (slow reads). Approach 2: Pre-compute timelines at write time (fan-out on write). Twitter uses a hybrid: fan-out for most users, but query-time for celebrities with millions of followers.

⚠️WARNING

Averages Are Misleading

Don't use average response time to measure performance. Use percentiles instead. The p50 (median) tells you what a typical user experiences. The p99 tells you about worst-case scenarios. A system can have a great average but terrible tail latency affecting many users.

Quotes

💬QUOTE

“A fault is not the same as a failure. A fault is when one component deviates from its spec, whereas a failure is when the system as a whole stops providing the required service.”

— Page 6

💬QUOTE

“Even if you are working on a system that handles 'only' a few thousand requests per second, you should still think about its scalability.”

— Page 14