Chapter 4 of 4

Derived Data and Stream Processing

Batch processing, stream processing, and how to build reliable data pipelines.

Key Insights

💡KEY INSIGHT

The distinction between 'system of record' and 'derived data' is fundamental to good data architecture.

💡KEY INSIGHT

Batch processing (MapReduce) and stream processing (Kafka, Flink) are complementary approaches to transforming data.

💡KEY INSIGHT

Event sourcing and change data capture turn your database into a stream, enabling powerful derived views.

Notes

📘CONCEPT

Systems of Record vs Derived Data

A system of record (source of truth) holds authoritative data — each fact is represented exactly once. Derived data (caches, indexes, materialized views) is the result of transforming the source data. If you lose derived data, you can recreate it from the source. This distinction simplifies architecture and debugging.

📘CONCEPT

MapReduce and Batch Processing

MapReduce processes large datasets by: (1) Map — extract key-value pairs from input. (2) Sort — group by key. (3) Reduce — aggregate values per key. The key insight: by reading input and writing output (no side effects), jobs are retryable and composable. Hadoop popularized this; Spark improved on it with in-memory processing.

📘CONCEPT

Stream Processing with Event Logs

An event log (like Kafka) is an append-only, ordered sequence of events. Producers write events; consumers read them. Unlike message queues, consumers can re-read events. This enables: decoupled microservices, event sourcing, change data capture, and real-time analytics.

TIP

Change Data Capture (CDC)

CDC captures every change made to a database and streams it as events. This lets you: keep caches in sync, update search indexes in real-time, replicate data to a data warehouse, and trigger downstream processing. Tools: Debezium, Maxwell, Kafka Connect. It turns your database's write-ahead log into a stream.

Quotes

💬QUOTE
In a sense, batch processing is a special case of stream processing — a stream with a finite, known input.

— Page 451

💬QUOTE
The derived data approach — transforming data from one form to another — is so broadly applicable that it pops up in many different areas of computing.

— Page 499