Did we keep your catalog intact?

2026-06-18T15:39:36Z by Showboat 0.6.1

Did we keep your catalog intact?

A data lake is only trustworthy if its copy of a record is the record.

When the catalog’s MARC records flow into the lake, a fair question from any cataloger is: did you keep them intact? Not “approximately” — exactly. A parser bug, a bad migration, a silent encoding slip could drop a field or mangle a record, and nobody would notice until a title went missing from a search months later.

So the platform carries a standing parse-fidelity invariant: every ingest run samples its records and proves the stored copy still matches the raw bytes the catalog sent. This walkthrough runs that real check — chimpy_lake.quality.check_parse_fidelity and its byte-level helpers — over a small fixture of valid ISO-2709 records. (pymarc builds the fixture bytes, exactly as the unit tests do; the validator itself never imports a MARC parser.) Every command reproduces — showboat verify re-runs each one and diffs the output.

1 · The byte-for-byte proof — F2

The check has four facets, but one is load-bearing: F2 reads the field tags straight from the raw ISO-2709 bytes — walking the record’s directory by hand, no MARC parser — and compares them to the tags in the stored marc_json. It is the only facet that checks the stored JSON against the raw source bytes, so it’s the one that actually proves nothing was lost in translation.

uv run python docs/demos/parse-fidelity/_driver.py fidelity


The byte-for-byte proof — F2
============================
A real catalog record — Holes, by Louis Sachar — as the ISO-2709 bytes Sierra returned:
  leader:       '00000nam a2200000 a 4500'
  raw length:   231 bytes
  base address: 97  (leader[12:17] — where the record's directory ends)

F2 reads the field tags STRAIGHT FROM THE RAW BYTES — walking the ISO-2709 directory,
no MARC parser — and compares them to the tags in the stored marc_json. It is the ONLY
facet that checks the stored JSON against the raw *source* bytes, so it is the one that
actually proves fidelity:

  tags from the raw ISO-2709 directory : ['001', '020', '100', '245', '250', '264']
  tags in the stored marc_json         : ['001', '020', '100', '245', '250', '264']

  check_parse_fidelity -> pass  (fail_count=0)
  The two tag multisets are equal: the lake's JSON carries exactly the fields the raw
  record declared — nothing dropped, added, or quietly reordered away.

2 · It has teeth — a dropped field is caught

A check that only ever says “pass” proves nothing. So: take the same record, but pretend the stored JSON silently lost its 245 (title) field. F2 compares the two tag multisets, catches the discrepancy, and names the exact bib that broke — it surfaces the bad record rather than hiding it behind a green light.

uv run python docs/demos/parse-fidelity/_driver.py teeth


It has teeth — a dropped field is caught
========================================
Now suppose the stored marc_json had silently lost the 245 (title) field — the kind of
corruption a parser bug or a bad migration could introduce. The raw bytes are unchanged:

  tags from the raw ISO-2709 directory : ['001', '020', '100', '245', '250', '264']
  tags in the (corrupted) marc_json    : ['001', '020', '100', '250', '264']   <- 245 is gone

  check_parse_fidelity -> fail
  facet_fails:      {'f2': 1, 'f3': 0, 'f4': 0, 'err': 0}
  failing_bib_ids:  [40115282]
  F2 compares multisets, so the missing 245 is caught and the exact bib is named — the
  invariant doesn't just assert success, it surfaces the record that broke it.

3 · Honest about which check proves what — F1, F3, F4

The other three facets matter, but they’re not all the same kind of evidence, and the code says so. F1 counts records in vs. out. F3 round-trips the JSON against itself — cheap defense-in-depth, and its own docstring warns “do not over-trust a green F3.” F4 checks the leader declares UTF-8; a MARC-8 leader is a correct partial to flag, not a false alarm. A system that grades its own evidence honestly is one you can actually trust.

uv run python docs/demos/parse-fidelity/_driver.py honest


Honest about which check proves what — F1, F3, F4
=================================================
F1 — record count: every raw ISO-2709 record must come out the other side.
   raw=10, parsed=9  ->  fail   (f1={'raw': 10, 'parsed': 9, 'ok': False})

F3 — round-trip: cheap defense-in-depth, and the code is candid about it in its own
   docstring — "it round-trips marc_json against itself ... do not over-trust a green F3."
   round_trip_ok=False  ->  fail   (f3=1)

F4 — leader/encoding: leader/09 must be 'a' (UTF-8). A blank means the source is still
   MARC-8 — with force_utf8 upstream that's a CORRECT partial to notice, not a false alarm.
   leader/09 blanked  ->  fail   (f4=1)

Four facets, graded honestly: F2 is load-bearing (it alone checks the raw bytes); F1/F3/F4
are supporting checks — and the system says which is which, rather than dressing them all
up as the same kind of proof.

4 · Never-fatal — one bad record can’t crash or hide

Across two million bibs, a single corrupt record is inevitable. The check treats a malformed record as a fidelity failure it counts and names — never an exception that aborts the run. One poison record can’t take the whole check down, and it can’t hide behind a run-wide skip either.

uv run python docs/demos/parse-fidelity/_driver.py never_fatal


Never-fatal — one poison record can't crash or hide in the run
==============================================================
Three sampled records — the middle one's raw bytes are garbage (b'\x00garbage'), the kind
of single corrupt record that must never take down a two-million-bib run:

  status:          fail
  sample_size:     3    pass_count: 2    fail_count: 1
  facet_fails:     {'f2': 0, 'f3': 0, 'f4': 0, 'err': 1}   (err = unparseable, isolated per-record)
  failing_bib_ids: [2]

The malformed record is COUNTED as a fidelity failure and its bib named — but it's caught
per-record, so the check finishes and the other two still pass. A raised exception would
have aborted the whole check and hidden the one bad record behind a run-wide skip; this
invariant refuses to let that happen.

Proof

Every command above re-runs under showboat verify. The on-thesis invariants — F2 proves a clean record and catches a dropped field (naming the bib), and a malformed record is counted rather than raised — are pinned by the demo’s own guard test, run against the real validator:

uv run pytest tests/demos/test_parse_fidelity_demo.py -q 2>&1 | sed -E 's/ in [0-9.]+s//'

....                                                                     [100%]
4 passed

Where this goes next — open questions, not commitments

Parse-fidelity ships today as a standing per-run check. The next moves are ideas, not commitments, posed as questions:

Fidelity at full scale, continuously. Today the check samples each run. Should it sweep the whole 2-million-bib table on a cadence, so a slow-burn corruption can’t hide between samples — and what’s the right cost/coverage trade-off?
Round-trip against the source, not just the bytes. F2 proves the lake’s JSON matches the bytes we received. A stronger claim would re-pull a sampled record live from Sierra and prove the lake still matches the system of record — catching drift the moment it appears.
Fidelity as a published contract. If we can prove intactness on demand, that proof could become a number we publish — a standing “your catalog, intact” guarantee a librarian could check, not just an internal log line.

The discipline is the point. We ship the part we can stand behind completely — every sampled record proven against its own raw bytes, the bad ones named, nothing guessed — and it all re-runs on demand.

← all walkthroughs · Rendered from 226199c on 2026-06-18 · showboat verify: reproduces. A living artifact — the version ledger is git.