2026-06-18T15:39:36Z by Showboat 0.6.1
A data lake is only trustworthy if its copy of a record is the record.
When the catalog’s MARC records flow into the lake, a fair question from any cataloger is: did you keep them intact? Not “approximately” — exactly. A parser bug, a bad migration, a silent encoding slip could drop a field or mangle a record, and nobody would notice until a title went missing from a search months later.
So the platform carries a standing parse-fidelity
invariant: every ingest run samples its records and proves the stored
copy still matches the raw bytes the catalog sent. This walkthrough runs
that real check —
chimpy_lake.quality.check_parse_fidelity and its byte-level
helpers — over a small fixture of valid ISO-2709 records. (pymarc builds
the fixture bytes, exactly as the unit tests do; the validator itself
never imports a MARC parser.) Every command reproduces —
showboat verify re-runs each one and diffs the output.
The check has four facets, but one is load-bearing:
F2 reads the field tags straight from the raw
ISO-2709 bytes — walking the record’s directory by hand, no MARC
parser — and compares them to the tags in the stored
marc_json. It is the only facet that checks the stored JSON
against the raw source bytes, so it’s the one that
actually proves nothing was lost in translation.
uv run python docs/demos/parse-fidelity/_driver.py fidelity
The byte-for-byte proof — F2
============================
A real catalog record — Holes, by Louis Sachar — as the ISO-2709 bytes Sierra returned:
leader: '00000nam a2200000 a 4500'
raw length: 231 bytes
base address: 97 (leader[12:17] — where the record's directory ends)
F2 reads the field tags STRAIGHT FROM THE RAW BYTES — walking the ISO-2709 directory,
no MARC parser — and compares them to the tags in the stored marc_json. It is the ONLY
facet that checks the stored JSON against the raw *source* bytes, so it is the one that
actually proves fidelity:
tags from the raw ISO-2709 directory : ['001', '020', '100', '245', '250', '264']
tags in the stored marc_json : ['001', '020', '100', '245', '250', '264']
check_parse_fidelity -> pass (fail_count=0)
The two tag multisets are equal: the lake's JSON carries exactly the fields the raw
record declared — nothing dropped, added, or quietly reordered away.
A check that only ever says “pass” proves nothing. So: take the same record, but pretend the stored JSON silently lost its 245 (title) field. F2 compares the two tag multisets, catches the discrepancy, and names the exact bib that broke — it surfaces the bad record rather than hiding it behind a green light.
uv run python docs/demos/parse-fidelity/_driver.py teeth
It has teeth — a dropped field is caught
========================================
Now suppose the stored marc_json had silently lost the 245 (title) field — the kind of
corruption a parser bug or a bad migration could introduce. The raw bytes are unchanged:
tags from the raw ISO-2709 directory : ['001', '020', '100', '245', '250', '264']
tags in the (corrupted) marc_json : ['001', '020', '100', '250', '264'] <- 245 is gone
check_parse_fidelity -> fail
facet_fails: {'f2': 1, 'f3': 0, 'f4': 0, 'err': 0}
failing_bib_ids: [40115282]
F2 compares multisets, so the missing 245 is caught and the exact bib is named — the
invariant doesn't just assert success, it surfaces the record that broke it.
The other three facets matter, but they’re not all the same kind of evidence, and the code says so. F1 counts records in vs. out. F3 round-trips the JSON against itself — cheap defense-in-depth, and its own docstring warns “do not over-trust a green F3.” F4 checks the leader declares UTF-8; a MARC-8 leader is a correct partial to flag, not a false alarm. A system that grades its own evidence honestly is one you can actually trust.
uv run python docs/demos/parse-fidelity/_driver.py honest
Honest about which check proves what — F1, F3, F4
=================================================
F1 — record count: every raw ISO-2709 record must come out the other side.
raw=10, parsed=9 -> fail (f1={'raw': 10, 'parsed': 9, 'ok': False})
F3 — round-trip: cheap defense-in-depth, and the code is candid about it in its own
docstring — "it round-trips marc_json against itself ... do not over-trust a green F3."
round_trip_ok=False -> fail (f3=1)
F4 — leader/encoding: leader/09 must be 'a' (UTF-8). A blank means the source is still
MARC-8 — with force_utf8 upstream that's a CORRECT partial to notice, not a false alarm.
leader/09 blanked -> fail (f4=1)
Four facets, graded honestly: F2 is load-bearing (it alone checks the raw bytes); F1/F3/F4
are supporting checks — and the system says which is which, rather than dressing them all
up as the same kind of proof.
Across two million bibs, a single corrupt record is inevitable. The check treats a malformed record as a fidelity failure it counts and names — never an exception that aborts the run. One poison record can’t take the whole check down, and it can’t hide behind a run-wide skip either.
uv run python docs/demos/parse-fidelity/_driver.py never_fatal
Never-fatal — one poison record can't crash or hide in the run
==============================================================
Three sampled records — the middle one's raw bytes are garbage (b'\x00garbage'), the kind
of single corrupt record that must never take down a two-million-bib run:
status: fail
sample_size: 3 pass_count: 2 fail_count: 1
facet_fails: {'f2': 0, 'f3': 0, 'f4': 0, 'err': 1} (err = unparseable, isolated per-record)
failing_bib_ids: [2]
The malformed record is COUNTED as a fidelity failure and its bib named — but it's caught
per-record, so the check finishes and the other two still pass. A raised exception would
have aborted the whole check and hidden the one bad record behind a run-wide skip; this
invariant refuses to let that happen.
Every command above re-runs under showboat verify. The
on-thesis invariants — F2 proves a clean record and catches a dropped
field (naming the bib), and a malformed record is counted rather than
raised — are pinned by the demo’s own guard test, run against the real
validator:
uv run pytest tests/demos/test_parse_fidelity_demo.py -q 2>&1 | sed -E 's/ in [0-9.]+s//'.... [100%]
4 passed
Parse-fidelity ships today as a standing per-run check. The next moves are ideas, not commitments, posed as questions:
The discipline is the point. We ship the part we can stand behind completely — every sampled record proven against its own raw bytes, the bad ones named, nothing guessed — and it all re-runs on demand.
← all walkthroughs · Rendered from 226199c on 2026-06-18 · showboat verify: reproduces. A living artifact — the version ledger is git.