chimpy-lake showboats

Narrated, reproducible walkthroughs of working code from the chimpy-lake data platform. Each one shows real commands and their captured output; each is a living artifact — always the current render of current code.

Walkthroughs

Declare to publish (Simple English) published · 2026-07-20

the open-tier PII rule, in Simple English Wikipedia style — short sentences, every hard word explained

Related: Declare to publish

A Simple English retelling of the 'declare to publish' walkthrough: the library keeps private reader facts out of open reports because a report may show a column only if a person wrote it on a list first. Same real, reproducible steps; plain words.

Simple Englishdefault-denyPII governanceallowlistplain languageaccessibilitysensitivity tiersdry run

Declare to publish published · 2026-07-20

how the open mart tier proves it carries no PII — by refusing what a manifest didn't declare

Follows one mart through the open-tier governance choke point — where publishing is default-deny by declaration, not by scanning — then to the merged-but-gated machinery that will carry it to the ils-reports host, and an honest map of what's left.

default-denyPII governancedata contractsallowlistsensitivity tiersfail-openfail-closedDatasettedry runmart tiering

The consistency report published · 2026-07-07

making reference-data drift visible — never fatal, never silent

A tour of the reference data-consistency report: a declarative check registry + a never-fatal runner that flags location/code drift in the published reference dims, writes findings to a queryable table, and reports severity to the telemetry hub — with two honest failure domains (fail-open per check, fail-loud on a broken substrate).

data qualitydata consistencyreference dimensionssingle source of truthdeclarative checksfail-openfailure domainsnever-fatalobservabilityappend-onlytelemetry envelopeDuckDB

Flat reference dimensions — one manifest, generated dims published · 2026-07-06

turning Sierra codes into staff-facing labels, without hand-copying the machinery per dim

Related: chimpy-lake, by the book

How we industrialised Sierra's flat code->label dictionaries: one versioned manifest is the single source of truth, a generator emits the committed dbt dims from it, the extract tenant harvests bronze from the same manifest, and a freshness guard fails the build on any drift. Runs the real manifest, generator, and macro over a frozen no-PII fixture.

code->label dimensionsingle source of truthDRYcode generationmedallionschema-on-readSCD Type 0SCD Type 1conformed dimensionfreshness guarddata contractsdata qualitydbtDuckDBprovenancegrain

What did the catalog say last Tuesday? published · 2026-06-18

bitemporal bronze — keep every nightly snapshot as churn, reconstruct any past state exactly

The sierra-snapshot tenant keeps a bitemporal bronze: every nightly Sierra snapshot is set-diffed against the prior state and stored as just the churn — only what moved — each row stamped with the epoch it opened and closed. This runs the real engine: the pure diff classifier, three nights ingested into the envelope (stored as base + churn, not full copies), an as_of() time-travel that reconstructs each night exactly, and the safety rails — idempotent re-runs and strictly forward epochs — that make the timeline impossible to corrupt.

bitemporaltime-travelsnapshotsierra-snapshotbronzeSCD Type 2epochdiffchurnidempotencymonotonic epochsDuckDBtenantschimpy-lake

Scaffolding a tenant: from cookiecutter to conformance published · 2026-06-18

one command turns nothing into a tenant the platform already trusts

Drive the real cookiecutter to mint a tenant, validate its Tier-3 manifest + version provenance, and run the live conformance gate green — onboarding as generation plus a gate.

scaffoldcookiecuttertenantsmanifestprovenanceconformancelockstepplugin contract

One lake, many tenants: the control plane published · 2026-06-18

how a tenant declares itself, inherits a lifecycle, and is driven by one operator CLI

Related: chimpy-lake, by the book

The platform claim made concrete: each data source is a tenant that declares itself in a manifest, inherits a uniform lifecycle, and is installed + run from one control-plane CLI — idempotently and safely, over a small no-PII fixture.

tenantslifecyclecontrol planemanifestdata contractstiersinstance registryidempotencydesired-state reconciliationQuadletrun-now

Find books at the right level published · 2026-06-18

the payoff — browse a curated reading list by grade band, estimates marked and inspectable

Related: What grade is this book?

The companion to "What grade is this book?": where that one extracted reading levels honestly, this one is the patron payoff — turning scattered, multi-scale measures (Lexile / ATOS / Fountas & Pinnell) into one browsable grade band per book. A real catalogued grade shows solid; a converted or AI-estimated band shows faint with an ⓘ that links to how it was made. AI estimates are included in the browse facet (marked), so the surface stays useful without ever passing an estimate off as the catalog's word.

reading levelsgrade bandbrowse by levelLexileATOSFountas and Pinnellnative-first consensusAI estimateself-consistencylocal modelabstaintransparencyfaceted browseDatasettehonest datachimpy-readerNewbery

Did we keep your catalog intact? published · 2026-06-18

the parse-fidelity invariant — proving, byte-for-byte, that the lake's copy IS the record

Related: chimpy-lake, by the book

A data lake is only trustworthy if its copy of a record is the record. This runs the platform's standing parse-fidelity check for real: it reads each record's field tags straight from the raw ISO-2709 bytes and proves they match the stored marc_json, catches a dropped field by naming the exact bib, is honest about which of its four facets actually proves fidelity, and treats a corrupt record as a counted failure — never a crash that could hide it.

parse fidelityMARCISO-2709catalogdata qualitynever-fatalbyte-for-byteingest invarianttenantschimpy-lake

What grade is this book? published · 2026-06-17

reading levels, honestly — what the catalog actually says, and where it's silent

Related: Find books at the right level

We measured our own catalog and found reading-level data sparse and noisy — most of the 521 "audience" field is movie ratings, not levels. RL1 is the honest extractor that answers anyway: it parses the real 526/521 fields, stores every value verbatim, throws out the movie ratings, and never guesses a number it can't read. Estimates and a unified browsable grade band come later, clearly flagged.

reading levelsLexileATOSGuided ReadingFountas and PinnellMARC fieldinterest agegrade bandAV rating filterdemand-scopedidempotencySCD Type 0verbatimhonest datachimpy-reader

chimpy-lake, by the book published · 2026-06-08

a library data platform, pattern by pattern through the data-engineering canon

A capstone tour of the platform itself — every moving part mapped to an established data-engineering pattern and shown running over a small, no-PII fixture.

ELTschema-on-readdata contractsmedalliontenantsdbtDuckDBidempotencySCD Type 0data qualityobservabilitycontrol plane

From a curated list to the books on our shelves published · 2026-06-08

matching curated reading lists against the CHPL catalog — with a human in the loop

Every reading list — hand-authored or pulled from an award — becomes a named, attributed, browsable entry showing how much of it we already own. Follow three real Newbery titles through the match: the machine is confident where identifiers align, and hands the uncertain ones to a librarian who confirms the right record. The honest payoff — of 177 award titles we own 169 (every Medal winner, 88.6% of Honor) — and the rest is a tidy acquisition list.

reading listscatalog matchingidentifier normalizationISBNtitle-author keyhuman in the loopadjudicationacquisition listhonest gapsDatasette

Index of keywords

abstain

Withholding an estimate when the model's runs disagree too much — 'don't know' is an honest answer. → Find books at the right level

accessibility

Declare to publish (Simple English)

acquisition list

The titles we've confirmed the library does NOT own — a ready-made shopping list. → From a curated list to the books on our shelves

adjudication

A librarian's explicit accept of one match candidate the machine couldn't confidently resolve. → From a curated list to the books on our shelves

AI estimate

A grade band a local model infers when no catalogued measure exists — always marked visually distinct. → Find books at the right level

allowlist

Declare to publish (Simple English), Declare to publish

append-only

The consistency report

ATOS

A grade-equivalent readability formula (e.g. AR 4.6) used by Accelerated Reader. → Find books at the right level, What grade is this book?

Further reading: Wikipedia: Accelerated Reader ↗

AV rating filter

The step that discards AV/MPAA ratings from MARC 521 before any reading-level data is stored. → What grade is this book?

bitemporal

Each row carries two clocks — when the fact was true and when we recorded it — so any past state rebuilds exactly. → What did the catalog say last Tuesday?

Further reading: Wikipedia: Temporal database ↗

bronze

The raw, append-only landing layer of the medallion — data as sourced, never edited in place. → What did the catalog say last Tuesday?

browse by level

Filtering a list to show only titles within a chosen grade range. → Find books at the right level

byte-for-byte

Two copies are identical at the binary level — no re-encoding, reordering, or silent loss. → Did we keep your catalog intact?

catalog

The library's full set of bibliographic records describing every title held. → Did we keep your catalog intact?

catalog matching

Linking a list entry to the library's catalog (bib) record(s). → From a curated list to the books on our shelves

chimpy-lake

The library's internal data-lake platform — a medallion store on DuckDB/DuckLake fed by Sierra, OverDrive, and circulation. → What did the catalog say last Tuesday?, Did we keep your catalog intact?

chimpy-reader

The library's public reading-list browser, built on Datasette over the chimpy-lake pipeline. → Find books at the right level, What grade is this book?

churn

The rows that actually changed between two snapshots — what we store instead of a full copy. → What did the catalog say last Tuesday?

code generation

Emitting committed source files from one declaration (the manifest), rather than hand-writing each. → Flat reference dimensions — one manifest, generated dims

code->label dimension

A small lookup table mapping a raw code to a human-readable label (e.g. item_status '!' → ON HOLDSHELF). → Flat reference dimensions — one manifest, generated dims

Further reading: Wikipedia: Lookup table ↗

conformance

A tenant passes the platform's contract test suite (manifest + SDK + Quadlet + smoke). → Scaffolding a tenant: from cookiecutter to conformance

conformed dimension

A shared, standardized lookup that every mart joins to the same way, so labels mean one thing lake-wide. → Flat reference dimensions — one manifest, generated dims

Further reading: Wikipedia: Dimension (data warehouse) ↗

control plane

One place to see and operate every pipeline's status across the fleet. → One lake, many tenants: the control plane, chimpy-lake, by the book

cookiecutter

A tool that generates a project from a templated directory and a set of answers. → Scaffolding a tenant: from cookiecutter to conformance

chimpy-lake showboats

Walkthroughs

Index of keywords

Talks & related