Narrated, reproducible walkthroughs of working code from the chimpy-lake data platform. Each one shows real commands and their captured output; each is a living artifact — always the current render of current code.
bitemporal bronze — keep every nightly snapshot as churn, reconstruct any past state exactly
The sierra-snapshot tenant keeps a bitemporal bronze: every nightly Sierra snapshot is set-diffed against the prior state and stored as just the churn — only what moved — each row stamped with the epoch it opened and closed. This runs the real engine: the pure diff classifier, three nights ingested into the envelope (stored as base + churn, not full copies), an as_of() time-travel that reconstructs each night exactly, and the safety rails — idempotent re-runs and strictly forward epochs — that make the timeline impossible to corrupt.
bitemporaltime-travelsnapshotsierra-snapshotbronzeSCD Type 2epochdiffchurnidempotencymonotonic epochsDuckDBtenantschimpy-lake
Drive the real cookiecutter to mint a tenant, validate its Tier-3 manifest + version provenance, and run the live conformance gate green — onboarding as generation plus a gate.
The platform claim made concrete: each data source is a tenant that declares itself in a manifest, inherits a uniform lifecycle, and is installed + run from one control-plane CLI — idempotently and safely, over a small no-PII fixture.
The companion to "What grade is this book?": where that one extracted reading levels honestly, this one is the patron payoff — turning scattered, multi-scale measures (Lexile / ATOS / Fountas & Pinnell) into one browsable grade band per book. A real catalogued grade shows solid; a converted or AI-estimated band shows faint with an ⓘ that links to how it was made. AI estimates are included in the browse facet (marked), so the surface stays useful without ever passing an estimate off as the catalog's word.
reading levelsgrade bandbrowse by levelLexileATOSFountas and Pinnellnative-first consensusAI estimateself-consistencylocal modelabstaintransparencyfaceted browseDatasettehonest datachimpy-readerNewbery
A data lake is only trustworthy if its copy of a record is the record. This runs the platform's standing parse-fidelity check for real: it reads each record's field tags straight from the raw ISO-2709 bytes and proves they match the stored marc_json, catches a dropped field by naming the exact bib, is honest about which of its four facets actually proves fidelity, and treats a corrupt record as a counted failure — never a crash that could hide it.
We measured our own catalog and found reading-level data sparse and noisy — most of the 521 "audience" field is movie ratings, not levels. RL1 is the honest extractor that answers anyway: it parses the real 526/521 fields, stores every value verbatim, throws out the movie ratings, and never guesses a number it can't read. Estimates and a unified browsable grade band come later, clearly flagged.
reading levelsLexileATOSGuided ReadingFountas and PinnellMARC fieldinterest agegrade bandAV rating filterdemand-scopedidempotencySCD Type 0verbatimhonest datachimpy-reader
a library data platform, pattern by pattern through the data-engineering canon
A capstone tour of the platform itself — every moving part mapped to an established data-engineering pattern and shown running over a small, no-PII fixture.
ELTschema-on-readdata contractsmedalliontenantsdbtDuckDBidempotencySCD Type 0data qualityobservabilitycontrol plane
matching curated reading lists against the CHPL catalog — with a human in the loop
Every reading list — hand-authored or pulled from an award — becomes a named, attributed, browsable entry showing how much of it we already own. Follow three real Newbery titles through the match: the machine is confident where identifiers align, and hands the uncertain ones to a librarian who confirms the right record. The honest payoff — of 177 award titles we own 169 (every Medal winner, 88.6% of Honor) — and the rest is a tidy acquisition list.
reading listscatalog matchingidentifier normalizationISBNtitle-author keyhuman in the loopadjudicationacquisition listhonest gapsDatasette
Index of keywords
abstain
Withholding an estimate when the model's runs disagree too much — 'don't know' is an honest answer. → Find books at the right level
The step that discards AV/MPAA ratings from MARC 521 before any reading-level data is stored. → What grade is this book?
bitemporal
Each row carries two clocks — when the fact was true and when we recorded it — so any past state rebuilds exactly. → What did the catalog say last Tuesday?
Processing only the titles a curated list references, not the whole catalog. → What grade is this book?
desired-state reconciliation
Declare the target state and let the tool make the host match it — the same model as a Kubernetes controller or kubectl apply. → One lake, many tenants: the control plane
The uniform verb set (run / status / dry-run / migrate) every onboarded tenant answers, so the operator never learns a per-tenant CLI. → One lake, many tenants: the control plane
local model
Running AI on library-owned hardware so patron-adjacent text never leaves the network. → Find books at the right level
Let a catalogued grade win over a converted or estimated one; real disagreement widens the band, never invents a midpoint. → Find books at the right level
never-fatal
Per-record errors are caught, counted, and named — never raised to abort the run or hidden by a skip. → Did we keep your catalog intact?
How deeply a tenant plugs in — Tier-2 = a validated data contract; Tier-3 = contract plus a lifecycle the control plane can operate. → One lake, many tenants: the control plane