chimpy-lake showboats

Narrated, reproducible walkthroughs of working code from the chimpy-lake data platform. Each one shows real commands and their captured output; each is a living artifact — always the current render of current code.

Walkthroughs

What did the catalog say last Tuesday? published · 2026-06-18
bitemporal bronze — keep every nightly snapshot as churn, reconstruct any past state exactly
The sierra-snapshot tenant keeps a bitemporal bronze: every nightly Sierra snapshot is set-diffed against the prior state and stored as just the churn — only what moved — each row stamped with the epoch it opened and closed. This runs the real engine: the pure diff classifier, three nights ingested into the envelope (stored as base + churn, not full copies), an as_of() time-travel that reconstructs each night exactly, and the safety rails — idempotent re-runs and strictly forward epochs — that make the timeline impossible to corrupt.
bitemporaltime-travelsnapshotsierra-snapshotbronzeSCD Type 2epochdiffchurnidempotencymonotonic epochsDuckDBtenantschimpy-lake
Scaffolding a tenant: from cookiecutter to conformance published · 2026-06-18
one command turns nothing into a tenant the platform already trusts
Related: One lake, many tenants: the control plane
Drive the real cookiecutter to mint a tenant, validate its Tier-3 manifest + version provenance, and run the live conformance gate green — onboarding as generation plus a gate.
scaffoldcookiecuttertenantsmanifestprovenanceconformancelockstepplugin contract
One lake, many tenants: the control plane published · 2026-06-18
how a tenant declares itself, inherits a lifecycle, and is driven by one operator CLI
Related: chimpy-lake, by the book
The platform claim made concrete: each data source is a tenant that declares itself in a manifest, inherits a uniform lifecycle, and is installed + run from one control-plane CLI — idempotently and safely, over a small no-PII fixture.
tenantslifecyclecontrol planemanifestdata contractstiersinstance registryidempotencydesired-state reconciliationQuadletrun-now
Find books at the right level published · 2026-06-18
the payoff — browse a curated reading list by grade band, estimates marked and inspectable
Related: What grade is this book?
The companion to "What grade is this book?": where that one extracted reading levels honestly, this one is the patron payoff — turning scattered, multi-scale measures (Lexile / ATOS / Fountas & Pinnell) into one browsable grade band per book. A real catalogued grade shows solid; a converted or AI-estimated band shows faint with an ⓘ that links to how it was made. AI estimates are included in the browse facet (marked), so the surface stays useful without ever passing an estimate off as the catalog's word.
reading levelsgrade bandbrowse by levelLexileATOSFountas and Pinnellnative-first consensusAI estimateself-consistencylocal modelabstaintransparencyfaceted browseDatasettehonest datachimpy-readerNewbery
Did we keep your catalog intact? published · 2026-06-18
the parse-fidelity invariant — proving, byte-for-byte, that the lake's copy IS the record
Related: chimpy-lake, by the book
A data lake is only trustworthy if its copy of a record is the record. This runs the platform's standing parse-fidelity check for real: it reads each record's field tags straight from the raw ISO-2709 bytes and proves they match the stored marc_json, catches a dropped field by naming the exact bib, is honest about which of its four facets actually proves fidelity, and treats a corrupt record as a counted failure — never a crash that could hide it.
parse fidelityMARCISO-2709catalogdata qualitynever-fatalbyte-for-byteingest invarianttenantschimpy-lake
What grade is this book? published · 2026-06-17
reading levels, honestly — what the catalog actually says, and where it's silent
Related: Find books at the right level
We measured our own catalog and found reading-level data sparse and noisy — most of the 521 "audience" field is movie ratings, not levels. RL1 is the honest extractor that answers anyway: it parses the real 526/521 fields, stores every value verbatim, throws out the movie ratings, and never guesses a number it can't read. Estimates and a unified browsable grade band come later, clearly flagged.
reading levelsLexileATOSGuided ReadingFountas and PinnellMARC fieldinterest agegrade bandAV rating filterdemand-scopedidempotencySCD Type 0verbatimhonest datachimpy-reader
chimpy-lake, by the book published · 2026-06-08
a library data platform, pattern by pattern through the data-engineering canon
A capstone tour of the platform itself — every moving part mapped to an established data-engineering pattern and shown running over a small, no-PII fixture.
ELTschema-on-readdata contractsmedalliontenantsdbtDuckDBidempotencySCD Type 0data qualityobservabilitycontrol plane
From a curated list to the books on our shelves published · 2026-06-08
matching curated reading lists against the CHPL catalog — with a human in the loop
Every reading list — hand-authored or pulled from an award — becomes a named, attributed, browsable entry showing how much of it we already own. Follow three real Newbery titles through the match: the machine is confident where identifiers align, and hands the uncertain ones to a librarian who confirms the right record. The honest payoff — of 177 award titles we own 169 (every Medal winner, 88.6% of Honor) — and the rest is a tidy acquisition list.
reading listscatalog matchingidentifier normalizationISBNtitle-author keyhuman in the loopadjudicationacquisition listhonest gapsDatasette

Index of keywords

abstain
Withholding an estimate when the model's runs disagree too much — 'don't know' is an honest answer. Find books at the right level
acquisition list
The titles we've confirmed the library does NOT own — a ready-made shopping list. From a curated list to the books on our shelves
adjudication
A librarian's explicit accept of one match candidate the machine couldn't confidently resolve. From a curated list to the books on our shelves
AI estimate
A grade band a local model infers when no catalogued measure exists — always marked visually distinct. Find books at the right level
ATOS
A grade-equivalent readability formula (e.g. AR 4.6) used by Accelerated Reader. Find books at the right level, What grade is this book?
AV rating filter
The step that discards AV/MPAA ratings from MARC 521 before any reading-level data is stored. What grade is this book?
bitemporal
Each row carries two clocks — when the fact was true and when we recorded it — so any past state rebuilds exactly. What did the catalog say last Tuesday?
bronze
The raw, append-only landing layer of the medallion — data as sourced, never edited in place. What did the catalog say last Tuesday?
browse by level
Filtering a list to show only titles within a chosen grade range. Find books at the right level
byte-for-byte
Two copies are identical at the binary level — no re-encoding, reordering, or silent loss. Did we keep your catalog intact?
catalog
The library's full set of bibliographic records describing every title held. Did we keep your catalog intact?
catalog matching
Linking a list entry to the library's catalog (bib) record(s). From a curated list to the books on our shelves
chimpy-lake
The library's internal data-lake platform — a medallion store on DuckDB/DuckLake fed by Sierra, OverDrive, and circulation. What did the catalog say last Tuesday?, Did we keep your catalog intact?
chimpy-reader
The library's public reading-list browser, built on Datasette over the chimpy-lake pipeline. Find books at the right level, What grade is this book?
churn
The rows that actually changed between two snapshots — what we store instead of a full copy. What did the catalog say last Tuesday?
conformance
A tenant passes the platform's contract test suite (manifest + SDK + Quadlet + smoke). Scaffolding a tenant: from cookiecutter to conformance
control plane
One place to see and operate every pipeline's status across the fleet. One lake, many tenants: the control plane, chimpy-lake, by the book
cookiecutter
A tool that generates a project from a templated directory and a set of answers. Scaffolding a tenant: from cookiecutter to conformance
Further reading: cookiecutter docs ↗
data contracts
A declared, validated shape a data source must satisfy before it can join the lake. One lake, many tenants: the control plane, chimpy-lake, by the book
data quality
Automated assertions on every load — row counts, nulls, uniqueness, schema drift. Did we keep your catalog intact?, chimpy-lake, by the book
Datasette
A tool that publishes a database as a browsable, queryable website. Find books at the right level, From a curated list to the books on our shelves
Further reading: Datasette ↗
dbt
A tool that builds SQL transformations as a tested, ordered dependency graph. chimpy-lake, by the book
demand-scoped
Processing only the titles a curated list references, not the whole catalog. What grade is this book?
desired-state reconciliation
Declare the target state and let the tool make the host match it — the same model as a Kubernetes controller or kubectl apply. One lake, many tenants: the control plane
Further reading: Wikipedia: Kubernetes ↗
diff
Comparing tonight's snapshot to the prior state row-by-row — new, changed, deleted, or unchanged. What did the catalog say last Tuesday?
DuckDB
An in-process analytical SQL database — like SQLite, but built for analytics. What did the catalog say last Tuesday?, chimpy-lake, by the book
ELT
Load the raw data first, transform it later — vs. ETL, which transforms before loading. chimpy-lake, by the book
epoch
An integer stamped on each nightly ingest run, used to open/close rows and order history. What did the catalog say last Tuesday?
faceted browse
Narrowing a list by picking a category from a sidebar — here, a grade band. Find books at the right level
Fountas and Pinnell
A leveled-reading system assigning books letters A–Z+ by text characteristics. Find books at the right level, What grade is this book?
grade band
A coarse range (K–2, 3–5, 6–8) that collapses competing reading scales into one browsable label. Find books at the right level, What grade is this book?
Guided Reading
Small-group reading instruction, and the A–Z leveling it popularized. What grade is this book?
honest data
Showing estimates and gaps visibly — including what we don't know — rather than hiding them. Find books at the right level, What grade is this book?
honest gaps
Naming the titles we could NOT confidently match, rather than hiding them. From a curated list to the books on our shelves
human in the loop
A design pattern where a person reviews and confirms uncertain machine output before it's treated as final. From a curated list to the books on our shelves
idempotency
Running a load twice has the same effect as running it once — retries can't corrupt. What did the catalog say last Tuesday?, One lake, many tenants: the control plane, What grade is this book?, chimpy-lake, by the book
identifier normalization
Cleaning and cross-deriving IDs — e.g. an ISBN-10 implies its ISBN-13 twin. From a curated list to the books on our shelves
Further reading: Wikipedia: ISBN ↗
ingest invariant
A rule checked automatically on every ingest run, so violations surface at load time. Did we keep your catalog intact?
instance registry
The per-host list of which tenants are deployed there; the inventory the control plane acts on. One lake, many tenants: the control plane
interest age
An age range (e.g. 8–12) for the intended reader, drawn from catalog notes — not a grade. What grade is this book?
ISBN
The standard book identifier; its 10- and 13-digit forms convert to each other. From a curated list to the books on our shelves
Further reading: Wikipedia: ISBN ↗
ISO-2709
The ISO standard for the binary wire format MARC records travel in. Did we keep your catalog intact?
Further reading: Wikipedia: ISO 2709 ↗
Lexile
A numeric reading-difficulty score for books and readers, from MetaMetrics (e.g. 850L). Find books at the right level, What grade is this book?
Further reading: Wikipedia: Lexile ↗
lifecycle
The uniform verb set (run / status / dry-run / migrate) every onboarded tenant answers, so the operator never learns a per-tenant CLI. One lake, many tenants: the control plane
local model
Running AI on library-owned hardware so patron-adjacent text never leaves the network. Find books at the right level
lockstep
Template, contract, and version kept equal by an enforced CI guard — drift can't merge. Scaffolding a tenant: from cookiecutter to conformance
manifest
A file in which a tenant declares what it is (name, schema, schedule), validated before it can join the lake. Scaffolding a tenant: from cookiecutter to conformance, One lake, many tenants: the control plane
MARC
The dominant library record format — a binary envelope of bibliographic fields (245 title, 020 ISBN…). Did we keep your catalog intact?
MARC field
A named slot in a MARC record (e.g. 245 title, 526 reading program, 521 target audience). What grade is this book?
medallion
Layered refinement: raw (bronze) → cleaned & typed (silver) → query-ready (gold). chimpy-lake, by the book
monotonic epochs
The rule that each ingest epoch must move strictly forward — out-of-order runs are refused. What did the catalog say last Tuesday?
native-first consensus
Let a catalogued grade win over a converted or estimated one; real disagreement widens the band, never invents a midpoint. Find books at the right level
never-fatal
Per-record errors are caught, counted, and named — never raised to abort the run or hidden by a skip. Did we keep your catalog intact?
Newbery
An annual award for the year's most distinguished American children's book. Find books at the right level
observability
The platform records every run and computes its own health from that ledger. chimpy-lake, by the book
parse fidelity
A standing guarantee that every stored record is byte-for-byte what the source sent. Did we keep your catalog intact?
plugin contract
The rules a tenant container must satisfy to run inside an instance (manifest + lifecycle + telemetry). Scaffolding a tenant: from cookiecutter to conformance
provenance
A record of where an artifact came from — here, which chimpy-lake version scaffolded a tenant. Scaffolding a tenant: from cookiecutter to conformance
Quadlet
A systemd-native unit file that describes a Podman container, so containers run as ordinary systemd services. One lake, many tenants: the control plane
Further reading: Podman: Quadlet ↗
reading levels
Standardized measures of how hard a book is to read, used to match books to readers. Find books at the right level, What grade is this book?
reading lists
Curated lists of titles (e.g. award winners) checked against what the library holds. From a curated list to the books on our shelves
run-now
Fire a tenant on demand via the same code path a timer fire uses — no parallel manual-run logic to drift. One lake, many tenants: the control plane
scaffold
Generate a correct project skeleton from a template + one command (here, a tenant). Scaffolding a tenant: from cookiecutter to conformance
SCD Type 0
A landed snapshot is never edited in place; history is append-only. What grade is this book?, chimpy-lake, by the book
SCD Type 2
Preserve history by adding a new row with open/close dates, instead of overwriting the old one. What did the catalog say last Tuesday?
schema-on-read
Store data exactly as it arrives; impose structure only when you query it. chimpy-lake, by the book
self-consistency
Run the model several times and only accept an answer when the runs agree. Find books at the right level
sierra-snapshot
The tenant that ingests the Sierra ILS catalog nightly and stores it as bitemporal bronze. What did the catalog say last Tuesday?
snapshot
A full read of a source system at a point in time, kept as an immutable record. What did the catalog say last Tuesday?
Per-source extract pipelines that feed the lake; each tenant owns its own ingest logic and landing zone. What did the catalog say last Tuesday?, Scaffolding a tenant: from cookiecutter to conformance, One lake, many tenants: the control plane, Did we keep your catalog intact?, chimpy-lake, by the book
tiers
How deeply a tenant plugs in — Tier-2 = a validated data contract; Tier-3 = contract plus a lifecycle the control plane can operate. One lake, many tenants: the control plane
time-travel
Reconstructing what the data looked like at any past date, without keeping a full copy of every night. What did the catalog say last Tuesday?
title-author key
A synthesized match key for records that arrive with no ISBN to match on. From a curated list to the books on our shelves
transparency
Linking every AI estimate back to the raw model output that produced it. Find books at the right level
verbatim
Raw catalog text stored exactly as found, beside any parsed interpretation — never normalized away. What grade is this book?

Talks & related

Building a Data Lake for Your Library talk
IUG 2026 conference talk — early experiments at CHPL with library-owned data infrastructure.
Our Data, Our Lake zine
A printable zine on owning the library's data — the lake, in plain language.