2026-06-09T00:19:03Z by Showboat 0.6.1
A library data platform, pattern by pattern through the data-engineering canon.
chimpy-lake is the data platform behind our library’s analytics — it pulls data from vendors like OverDrive and our Sierra ILS, refines it, and serves it. This walkthrough is a tour of how it’s built, told through the established data-engineering playbook. Every part you’ll see has a standard name and a decades-old rationale; none of it is bespoke cleverness — and that is exactly the claim.
Each beat teaches the pattern in plain language, then shows
the real code running over a small frozen sample. You don’t
need a data-engineering background to follow it — that’s the point. The
commands are real and reproducible (showboat verify re-runs
them); the data is a public fixture with no patron
data.
What. ETL = Extract, Transform, Load. ELT swaps the last two: we load the raw vendor data first, exactly as it arrives, then transform it downstream.
Why it matters. The original is never lost, reprocessing is cheap, and the vendor can change their format without breaking our load. We keep the truth and derive from it.
Term to know: schema-on-read — store now, impose structure when you query.
How we do it — one raw OverDrive page, landed verbatim:
uv run python docs/demos/by-the-book/_driver.py elt
Pattern — ELT: Extract, LOAD, then Transform
============================================
raw OverDrive page — top-level keys: ['nextPageUrl', 'totalCheckouts', 'checkouts']
checkouts in this page: 6
one checkout record, exactly as OverDrive sent it (19 fields):
checkoutId 'c-0001'
branch 'Main Library'
checkoutDateUtc '06/04/2026 09:14:00'
lendingPeriodDays 21
borrowedFrom 'Libby'
isRenewal False
userId 'u-100001'
boughtBy 'Consortium'
titleId 5550001
reserveId '00000000-0000-0000-0000-0000000000a1'
title 'The Midnight Library'
edition 'Unabridged'
creator 'Matt Haig'
isbn '9780525559474'
language '{"LanguageCode":"en","Name":"English"}'
audienceRating 'Adult Fiction'
subjects ['Fiction', 'Fantasy']
format 'eBook'
publisher 'Viking'
Nothing is reshaped at load time — the original is always recoverable, and we
can re-derive everything below from it. (Patron ids are synthetic in this fixture.)
What. Every data source (a “tenant”) ships a manifest declaring its shape and schedule, and a validator enforces that contract — a source that doesn’t satisfy it can’t join the lake.
Why it matters. Bad inputs are caught at the door, in CI — not at 3 a.m. in production. The contract is a single, checkable definition of “what a valid tenant is.”
Term to know: data contract / contract testing.
How we do it — one valid manifest passes; one bad manifest is refused with every violation at once:
uv run python docs/demos/by-the-book/_driver.py contract
Pattern — Data contracts: a tenant must satisfy its manifest to exist
=====================================================================
PASS good_tenant.toml satisfies the Tier-2 contract:
tenant.name overdrive-checkouts
tenant.kind extract
tenant.schema overdrive_raw
telemetry.sources ['overdrive-checkouts']
default_schedule OnCalendar=*-*-* 07,18:05:00 UTC
FAIL bad_tenant.toml is REFUSED — 4 contract violation(s), all at once:
tenant.name: must match ^[a-z][a-z0-9-]*$ (got 'BadName')
tenant.kind: must be one of ['extract', 'report'] (got 'warehouse')
runtime.default_schedule: must start with 'OnCalendar=' (got 'daily')
telemetry.emit_runs: must be bool (got str)
The contract is enforced in the conformance test suite, not at runtime — a bad
manifest fails CI, it never crashes production.
What. Data flows through named layers: bronze (raw) → silver (cleaned, typed, deduplicated) → gold (aggregated, query-ready “marts”). Each layer is more refined than the last.
Why it matters. Analysts read gold and trust it; engineers debug against bronze. Responsibilities are separated, and you never re-clean the same data twice.
Term to know: medallion architecture; a
dimension (dim_title) is a reusable lookup
table; a mart is a purpose-built summary.
How we do it — the real dbt SQL, run end-to-end over one frozen page:
uv run python docs/demos/by-the-book/_driver.py medallion
Pattern — Medallion architecture: raw -> silver -> gold
=======================================================
BRONZE (raw) 6 records, vendor shape: 19 camelCase fields, ISBNs as text
SILVER 6 rows, 17 typed snake_case columns, deduped on checkout_id
GOLD dim_title (6 titles) + query-ready marts
GOLD — dim_title (a conformed dimension, one row per title):
title_id audience format title
5550001 Adult Fiction eBook The Midnight Library
5550002 Adult Fiction Audiobook Tomorrow, and Tomorrow, and Tomorrow
5550003 Juvenile Fiction eBook The One and Only Ivan
5550004 Juvenile Fiction Audiobook Wonder
5550005 Young Adult Fiction eBook The Hate U Give
5550006 Adult Fiction eBook Project Hail Mary
GOLD — daily_by_audience (a mart: checkouts per day per audience):
day audience checkouts users
2026-06-04 Adult Fiction 3 3
2026-06-04 Juvenile Fiction 2 2
2026-06-04 Young Adult Fiction 1 1
This is the SAME dbt SQL that runs in production (there over ~59,770 deduped
checkouts); here it runs end-to-end over one frozen page.
What. A load is idempotent when running it twice has the same effect as running it once. chimpy-lake gets this two ways: it deduplicates within a load, and on re-runs an incremental anti-join inserts only genuinely new rows.
Why it matters. Pipelines crash and get retried. Idempotency means a retry can never double-count or corrupt — the data converges to the same answer every time.
Term to know: idempotency, incremental load, and SCD Type 0 (a landed snapshot is never edited — history is append-only).
How we do it — load the same page twice; the second pass adds nothing:
uv run python docs/demos/by-the-book/_driver.py idempotent
Pattern — Idempotent + incremental loads (SCD Type 0)
=====================================================
raw page: 6 rows, checkout_ids = ['d-1', 'd-1', 'd-2', 'd-3', 'd-3', 'd-4']
distinct checkout_ids: 4
LOAD #1 silver dedup (qualify row_number() over (partition by checkout_id)): 4 rows
LOAD #2 same page again, incremental anti-join: 0 new rows inserted
table row count after the re-run: 4 (unchanged)
Dedup collapses duplicate checkout_ids within a load; the anti-join means
re-running the SAME data inserts nothing. Re-runs converge — that is idempotent.
And history is append-only (snapshot-at-ingest, SCD Type 0): a landed row is
never updated in place, so the lake is a faithful record of what was seen.
What. Assertions that run on every load — is the row count sane, are key columns non-null and unique, did the schema drift? Like unit tests, but for data.
Why it matters. Silent data corruption is the worst kind: reports look fine while being wrong. These checks make bad data loud. And they’re never-fatal — a failure is recorded as a signal; the load still completes.
Term to know: data quality / assertions (cf. dbt tests, Great Expectations); a freshness SLO is the “how stale is too stale” line.
How we do it — a clean page passes; a defective page fails, loudly:
uv run python docs/demos/by-the-book/_driver.py quality
Pattern — Data quality as code: assertions that run on every load
=================================================================
GOOD page (6 records):
volume PASS 6 ingested (min —)
key_not_null PASS 0 null of 6
key_unique PASS 0 duplicate of 6
schema PASS baseline: 19 columns
BAD page (6 records, defects injected):
volume PASS 6 ingested (min —)
key_not_null FAIL 1 null of 6
key_unique FAIL 1 duplicate of 6
schema DRIFT removed ['publisher'] (18 vs 19 columns)
Volume also catches count anomalies against history, e.g. a partial fetch:
volume FAIL 412 vs prior 6740 — 93.9% delta > 20%
Nothing here raised; every result is a row the platform can act on later.
What. Every run — success or failure — writes a row
to an append-only ledger (_platform.runs).
From that one ledger the platform computes each tenant’s health and
presents a fleet-wide view
(chimpy-lake status --by-tenant).
Why it matters. One operator can see the health of every pipeline at a glance, and the system explains itself — no separate monitoring to wire up. The ledger is the single source of truth for “what ran, when, and did it work.”
Term to know: observability, control plane, run ledger.
How we do it — the real telemetry schema + health view, over a seeded ledger:
uv run python docs/demos/by-the-book/_driver.py observability
Pattern — Observability + the control plane (the run ledger)
============================================================
the run ledger — every run, every tenant (append-only):
source status records started_at
catalog succeeded 2,080,000 2026-06-08 03:00:00+00:00
circ-trans succeeded 46,813,744 2026-06-08 06:30:00+00:00
circ-trans failed — 2026-06-08 11:00:00+00:00
overdrive-checkouts succeeded 6,123 2026-06-07 18:05:00+00:00
overdrive-checkouts succeeded 6,740 2026-06-08 07:05:00+00:00
reading-lists succeeded 174 2026-05-31 04:00:00+00:00
the fleet at a glance — `chimpy-lake status --by-tenant` reads _platform.tenant_health:
source env health since last_error
catalog prod ok 9h
circ-trans prod last_run_failed 6h ParserException: syntax error near ';'
overdrive-checkouts prod ok 5h
reading-lists prod stale 200h
The view classifies each tenant (ok / stale / last_run_failed / never) from the
ledger against its SLO. schema_version=2, built by the real
forward-only migrations. (now() pinned to 2026-06-08 12:00 UTC for reproducibility.)
Every command above is reproducible — showboat verify
re-runs each one and diffs its output. And the on-thesis invariant (the
idempotent, incremental silver dedup) plus the contract and quality
assertions ship green:
uv run pytest tests/demos/test_by_the_book.py -q 2>&1 | sed -E 's/ in [0-9.]+s//'... [100%]
3 passed
Two more chapters of the playbook are designed but deliberately gated, posed here as questions rather than claims:
Nothing new under the lake — and that’s the point. The whole platform is the standard data-engineering playbook, implemented for a real library, and still running. Everything above re-runs on demand.
Rendered from chimpy-lake 4152989 on 2026-06-09 · showboat verify: reproduces. A living artifact — the version ledger is git.