Chapter 4 of 4
Derived Data and Stream Processing
Batch processing, stream processing, and how to build reliable data pipelines.
Key Insights
The distinction between 'system of record' and 'derived data' is fundamental to good data architecture.
Batch processing (MapReduce) and stream processing (Kafka, Flink) are complementary approaches to transforming data.
Event sourcing and change data capture turn your database into a stream, enabling powerful derived views.
Notes
Systems of Record vs Derived Data
A system of record (source of truth) holds authoritative data — each fact is represented exactly once. Derived data (caches, indexes, materialized views) is the result of transforming the source data. If you lose derived data, you can recreate it from the source. This distinction simplifies architecture and debugging.
MapReduce and Batch Processing
MapReduce processes large datasets by: (1) Map — extract key-value pairs from input. (2) Sort — group by key. (3) Reduce — aggregate values per key. The key insight: by reading input and writing output (no side effects), jobs are retryable and composable. Hadoop popularized this; Spark improved on it with in-memory processing.
Stream Processing with Event Logs
An event log (like Kafka) is an append-only, ordered sequence of events. Producers write events; consumers read them. Unlike message queues, consumers can re-read events. This enables: decoupled microservices, event sourcing, change data capture, and real-time analytics.
Change Data Capture (CDC)
CDC captures every change made to a database and streams it as events. This lets you: keep caches in sync, update search indexes in real-time, replicate data to a data warehouse, and trigger downstream processing. Tools: Debezium, Maxwell, Kafka Connect. It turns your database's write-ahead log into a stream.
Quotes
“In a sense, batch processing is a special case of stream processing — a stream with a finite, known input.”
— Page 451
“The derived data approach — transforming data from one form to another — is so broadly applicable that it pops up in many different areas of computing.”
— Page 499