chimpy-lake, by the book

2026-06-18T23:14:05Z by Showboat 0.6.1

chimpy-lake, by the book

A library data platform, pattern by pattern through the data-engineering canon.

chimpy-lake is the data platform behind our library’s analytics — it pulls data from vendors like OverDrive and our Sierra ILS, refines it, and serves it. This walkthrough is a tour of how it’s built, told through the established data-engineering playbook. Every part you’ll see has a standard name and a decades-old rationale; none of it is bespoke cleverness — and that is exactly the claim.

Each beat teaches the pattern in plain language, then shows the real code running over a small frozen sample. You don’t need a data-engineering background to follow it — that’s the point. The commands are real and reproducible (showboat verify re-runs them); the data is a public fixture with no patron data.

1 · ELT, not ETL

What. ETL = Extract, Transform, Load. ELT swaps the last two: we load the raw vendor data first, exactly as it arrives, then transform it downstream.

Why it matters. The original is never lost, reprocessing is cheap, and the vendor can change their format without breaking our load. We keep the truth and derive from it.

Term to know: schema-on-read — store now, impose structure when you query.

How we do it — one raw OverDrive page, landed verbatim:

uv run python docs/demos/by-the-book/_driver.py elt


Pattern — ELT: Extract, LOAD, then Transform
============================================
raw OverDrive page — top-level keys: ['nextPageUrl', 'totalCheckouts', 'checkouts']
  checkouts in this page: 6

one checkout record, exactly as OverDrive sent it (19 fields):
  checkoutId         'c-0001'
  branch             'Main Library'
  checkoutDateUtc    '06/04/2026 09:14:00'
  lendingPeriodDays  21
  borrowedFrom       'Libby'
  isRenewal          False
  userId             'u-100001'
  boughtBy           'Consortium'
  titleId            5550001
  reserveId          '00000000-0000-0000-0000-0000000000a1'
  title              'The Midnight Library'
  edition            'Unabridged'
  creator            'Matt Haig'
  isbn               '9780525559474'
  language           '{"LanguageCode":"en","Name":"English"}'
  audienceRating     'Adult Fiction'
  subjects           ['Fiction', 'Fantasy']
  format             'eBook'
  publisher          'Viking'

Nothing is reshaped at load time — the original is always recoverable, and we
can re-derive everything below from it. (Patron ids are synthetic in this fixture.)

2 · Data contracts

What. Every data source (a “tenant”) ships a manifest declaring its shape and schedule, and a validator enforces that contract — a source that doesn’t satisfy it can’t join the lake.

Why it matters. Bad inputs are caught at the door, in CI — not at 3 a.m. in production. The contract is a single, checkable definition of “what a valid tenant is.”

Term to know: data contract / contract testing.

How we do it — one valid manifest passes; one bad manifest is refused with every violation at once:

uv run python docs/demos/by-the-book/_driver.py contract


Pattern — Data contracts: a tenant must satisfy its manifest to exist
=====================================================================
PASS  good_tenant.toml satisfies the Tier-2 contract:
  tenant.name        overdrive-checkouts
  tenant.kind        extract
  tenant.schema      overdrive_raw
  telemetry.sources  ['overdrive-checkouts']
  default_schedule   OnCalendar=*-*-* 07,18:05:00 UTC

FAIL  bad_tenant.toml is REFUSED — 4 contract violation(s), all at once:
  tenant.name: must match ^[a-z][a-z0-9-]*$ (got 'BadName')
  tenant.kind: must be one of ['extract', 'report'] (got 'warehouse')
  runtime.default_schedule: must start with 'OnCalendar=' (got 'daily')
  telemetry.emit_runs: must be bool (got str)

The contract is enforced in the conformance test suite, not at runtime — a bad
manifest fails CI, it never crashes production.

3 · Medallion architecture

What. Data flows through named layers: bronze (raw) → silver (cleaned, typed, deduplicated) → gold (aggregated, query-ready “marts”). Each layer is more refined than the last.

Why it matters. Analysts read gold and trust it; engineers debug against bronze. Responsibilities are separated, and you never re-clean the same data twice.

Term to know: medallion architecture; a dimension (dim_title) is a reusable lookup table; a mart is a purpose-built summary.

How we do it — the real dbt SQL, run end-to-end over one frozen page:

uv run python docs/demos/by-the-book/_driver.py medallion


Pattern — Medallion architecture: raw -> silver -> gold
=======================================================
  BRONZE (raw)   6 records, vendor shape: 19 camelCase fields, ISBNs as text
  SILVER         6 rows, 17 typed snake_case columns, deduped on checkout_id
  GOLD           dim_title (6 titles) + query-ready marts

GOLD — dim_title (a conformed dimension, one row per title):
  title_id  audience             format     title
  5550001   Adult Fiction        eBook      The Midnight Library
  5550002   Adult Fiction        Audiobook  Tomorrow, and Tomorrow, and Tomorrow
  5550003   Juvenile Fiction     eBook      The One and Only Ivan
  5550004   Juvenile Fiction     Audiobook  Wonder
  5550005   Young Adult Fiction  eBook      The Hate U Give
  5550006   Adult Fiction        eBook      Project Hail Mary

GOLD — daily_by_audience (a mart: checkouts per day per audience):
  day          audience             checkouts  users
  2026-06-04   Adult Fiction                3      3
  2026-06-04   Juvenile Fiction             2      2
  2026-06-04   Young Adult Fiction          1      1

This is the SAME dbt SQL that runs in production (there over ~59,770 deduped
checkouts); here it runs end-to-end over one frozen page.

4 · Idempotent + incremental

What. A load is idempotent when running it twice has the same effect as running it once. chimpy-lake gets this two ways: it deduplicates within a load, and on re-runs an incremental anti-join inserts only genuinely new rows.

Why it matters. Pipelines crash and get retried. Idempotency means a retry can never double-count or corrupt — the data converges to the same answer every time.

Term to know: idempotency, incremental load, and SCD Type 0 (a landed snapshot is never edited — history is append-only).

How we do it — load the same page twice; the second pass adds nothing:

uv run python docs/demos/by-the-book/_driver.py idempotent


Pattern — Idempotent + incremental loads (SCD Type 0)
=====================================================
raw page: 6 rows, checkout_ids = ['d-1', 'd-1', 'd-2', 'd-3', 'd-3', 'd-4']
  distinct checkout_ids: 4

LOAD #1  silver dedup (qualify row_number() over (partition by checkout_id)): 4 rows
LOAD #2  same page again, incremental anti-join: 0 new rows inserted
         table row count after the re-run: 4  (unchanged)

Dedup collapses duplicate checkout_ids within a load; the anti-join means
re-running the SAME data inserts nothing. Re-runs converge — that is idempotent.
And history is append-only (snapshot-at-ingest, SCD Type 0): a landed row is
never updated in place, so the lake is a faithful record of what was seen.

5 · Data quality as code

What. Assertions that run on every load — is the row count sane, are key columns non-null and unique, did the schema drift? Like unit tests, but for data.

Why it matters. Silent data corruption is the worst kind: reports look fine while being wrong. These checks make bad data loud. And they’re never-fatal — a failure is recorded as a signal; the load still completes.

Term to know: data quality / assertions (cf. dbt tests, Great Expectations); a freshness SLO is the “how stale is too stale” line.

How we do it — a clean page passes; a defective page fails, loudly:

uv run python docs/demos/by-the-book/_driver.py quality


Pattern — Data quality as code: assertions that run on every load
=================================================================
GOOD page (6 records):
  volume           PASS  6 ingested (min —)
  key_not_null     PASS  0 null of 6
  key_unique       PASS  0 duplicate of 6
  schema           PASS  baseline: 19 columns

BAD page (6 records, defects injected):
  volume           PASS  6 ingested (min —)
  key_not_null     FAIL  1 null of 6
  key_unique       FAIL  1 duplicate of 6
  schema           DRIFT removed ['publisher'] (18 vs 19 columns)

Volume also catches count anomalies against history, e.g. a partial fetch:
  volume           FAIL  412 vs prior 6740 — 93.9% delta > 20%

Nothing here raised; every result is a row the platform can act on later.

6 · Observability + the control plane

What. Every run — success or failure — writes a row to an append-only ledger (_platform.runs). From that one ledger the platform computes each tenant’s health and presents a fleet-wide view (chimpy-lake status --by-tenant).

Why it matters. One operator can see the health of every pipeline at a glance, and the system explains itself — no separate monitoring to wire up. The ledger is the single source of truth for “what ran, when, and did it work.”

Term to know: observability, control plane, run ledger.

How we do it — the real telemetry schema + health view, over a seeded ledger:

uv run python docs/demos/by-the-book/_driver.py observability


Pattern — Observability + the control plane (the run ledger)
============================================================
the run ledger — every run, every tenant (append-only):
  source               status        records  started_at
  catalog              succeeded   2,080,000  2026-06-08 03:00:00+00:00
  circ-trans           succeeded  46,813,744  2026-06-08 06:30:00+00:00
  circ-trans           failed              —  2026-06-08 11:00:00+00:00
  overdrive-checkouts  succeeded       6,123  2026-06-07 18:05:00+00:00
  overdrive-checkouts  succeeded       6,740  2026-06-08 07:05:00+00:00
  reading-lists        succeeded         174  2026-05-31 04:00:00+00:00

the fleet at a glance — `chimpy-lake status --by-tenant` reads _platform.tenant_health:
  source               env   health            since  last_error
  catalog              prod  ok                   9h  
  circ-trans           prod  last_run_failed      6h  ParserException: syntax error near ';'
  overdrive-checkouts  prod  ok                   5h  
  reading-lists        prod  stale              200h  

The view classifies each tenant (ok / stale / last_run_failed / never) from the
ledger against its SLO. schema_version=3, built by the real
forward-only migrations. (now() pinned to 2026-06-08 12:00 UTC for reproducibility.)

Proof

Every command above is reproducible — showboat verify re-runs each one and diffs its output. And the on-thesis invariant (the idempotent, incremental silver dedup) plus the contract and quality assertions ship green:

uv run pytest tests/demos/test_by_the_book.py -q 2>&1 | sed -E 's/ in [0-9.]+s//'

...                                                                      [100%]
3 passed

Glossary

ELT / schema-on-read — load raw first, transform later; impose structure at query time. (Beat 1)
Data contract / contract testing — a tenant’s declared, validated shape; it must pass to join the lake. (Beat 2)
Medallion (bronze/silver/gold) — layered refinement from raw to query-ready. A dimension is a reusable lookup; a mart is a purpose-built summary. (Beat 3)
Idempotency — running a load twice has the same effect as running it once. (Beat 4)
Incremental load — process only new/changed rows, not everything. (Beat 4)
SCD Type 0 — a landed snapshot is never updated; history is append-only. (Beat 4)
Data quality / assertions — automated checks on each load: volume, null, unique, schema drift. (Beat 5)
Freshness SLO — the agreed limit on how stale data may get. (Beat 5)
Observability / control plane / run ledger — the platform records every run and computes its own health from that ledger. (Beat 6)

Where the canon goes next — open questions, not commitments

Two more chapters of the playbook are designed but deliberately gated, posed here as questions rather than claims:

Data mesh / federation. The platform is built so each domain owns its data, and a read-only endpoint could federate one library’s catalog to a peer. Not built — there’s no second instance yet.
Governance: CQRS, tokenization, k-anonymity. A separate, policy-gated initiative handles patron data with a read/write split, a vault that tokenizes identities, and suppression of small cells (k-anonymity). It is kept entirely off this public surface — named here, never run — pending the records-officer and counsel conversation.

Nothing new under the lake — and that’s the point. The whole platform is the standard data-engineering playbook, implemented for a real library, and still running. Everything above re-runs on demand.

← all walkthroughs · Rendered from 226199c on 2026-06-18 · showboat verify: reproduces. A living artifact — the version ledger is git.