Building a Data Lake for Your Library

Early Experiments at CHPL

Ray Voelker
Cincinnati & Hamilton County Public Library
ray.voelker@chpl.org

Previously ...

Datasette + SQLite = great for browsing a single dataset
Collection analysis, Newsdex — each a standalone tool
But what if you want to ask questions across all your data?

The Real Problem: Data Never Sits Still

Your ILS isn't a static database — it's constantly churning
Items are added, weeded, checked out, status codes change — every day
Digital platforms (OverDrive) have their own moving data
To analyze any of this properly, you need structured snapshots at regular intervals

What's Missing Today

Pulling a report today? Great. Comparing it to last month? Hard.
No consistent snapshots — just ad-hoc exports when someone asks
No metadata about when or how the data was captured
Changes vanish — once a record is updated, the old version is gone

What's a Data Lake?

A central repository for structured data from many sources
Snapshot at regular intervals with proper metadata
Open formats (Parquet, not proprietary exports)
Not a data warehouse — lighter, cheaper, more flexible
Built-in history — every snapshot is preserved

Data Sovereignty

Your data belongs to you — not a vendor
Library vendors control how you access your own data
Proprietary formats, limited exports, API restrictions
A data lake puts you in control — open formats, your infrastructure
If you switch vendors, your data lake stays

Sierra ILS

OverDrive API

Future sources

➔

MARC Records

Checkouts

…

➔

Data Lake

Parquet + DuckDB
DuckLake snapshots

➔

SQL Queries

Dashboards

CSV Exports

The Stack

Python — extraction scripts, MARC parsing (pymarc)
Parquet — open columnar format, compressed with zstd
DuckDB — fast analytical queries, no server needed
DuckLake — versioned storage with snapshots

Harvesting the ILS: sierra-marc-harvest

Open source Python CLI for Sierra ILS
Extracts MARC records via the Sierra REST API
Incremental updates — only fetches what changed
Declarative YAML profiles — institution-agnostic
Stores everything in compressed Parquet files

github.com/chimpy-me/ils-lake

Bibs (19 fields)

Title, author, publisher
Call number, ISBN, OCLC#
Subject headings, genres
Material type, bib level
Pub year, cataloging date
Complete MARC as JSON

Items (17 fields)

Barcode, location, item type
Status, call number
Checkout & renewal totals
Last checkout / checkin dates
Price, copy number, volume

The Extraction Profile

Declarative YAML — no code changes needed


tables:
  bibs:
    fields:
      title:
        source: 245$a$b
      author:
        source: [100$a, 110$a, 111$a]
        transform: first_available
      pub_year:
        source: 008[7:11]
        type: int16
      language:
        source: 008[35:38]

Profile Deep Dive: Selectors

Backup slide — for Q&A

MARC addressing grammar — point at the data you want:

Syntax	Meaning	Example
`245$a$b`	Subfields, concatenated	Title + subtitle
`008[7:11]`	Control field slice	Publication year
`[100$a, 110$a, 111$a]`	First available source	Author (personal, corporate, or meeting)
`6XX$a`	Wildcard — all matching fields	All subject headings (650, 651, etc.)

Profile Deep Dive: Transforms

Backup slide — for Q&A

Transforms are plain Python functions registered by name. Chainable: transform: [first_available, rstrip]

Value Transforms

sierra_id — parse .b12345678x to numeric ID
sierra_date — MM-DD-YY → ISO 8601
sierra_datetime — with time component
price_cents — $11.16 → 1116
rstrip, strip, normalize_text
s/pattern/replacement/ — inline regex

List Transforms

first_available — pick first non-empty value from multiple sources
collect_piped — gather repeating fields into Floods|Floods (1937)|Floods (1913)

Customization

Add a Python function to transforms.py
Register it in the TRANSFORMS dict
Reference by name in your YAML profile
Validated at profile load time

Why Parquet + DuckDB?

Parquet: columnar, compressed — a fraction of the size of raw exports
DuckDB: analytical SQL engine, runs on a laptop
No heavy infrastructure — DuckDB is embedded, catalog can be SQLite
Same ecosystem as data science and journalism
Query across tables, across sources — just SQL

DuckLake: Version Your Data

DuckLake — an open data lake format from the DuckDB Foundation
Every harvest creates a lightweight snapshot with metadata
Query your collection as it was on any date
Track changes over time — what was added, modified, deleted
Built-in retention policies — expire old snapshots, reclaim storage

ducklake.select

Beyond the ILS: OverDrive Checkouts

Separate ETL pulls digital checkout data from OverDrive API
Currently stores raw API responses — next step is Parquet
The goal: join ILS + OverDrive data in the same lake
One query to compare physical and digital circulation

github.com/cincinnatilibrary/chpl-etl-overdrive-checkouts

Where We Are (Honestly)

Sierra ILS harvest: working, tested — extracting real data into a test data lake
OverDrive ETL: working prototype — storing snapshots in Python pickle data on disk
Connecting the pieces into a unified lake: in progress
This is early — sharing the concept, not a finished product

What's Next

More data sources — holds, circ history, patron demographics
Pre-built queries and dashboards (Datasette or similar)
Scheduled harvests via systemd timers or cron
Documentation + packaging for other Sierra libraries

The Takeaway

Your library data doesn't have to live in silos
Open formats + free tools = a real data lake on a budget
Start small — one pipeline, one data source
The pattern scales: add sources as you go

Resources

Thanks!

Questions? Ideas? Other data sources you'd want in your lake?

This is early-stage work — feedback welcome!

Ray Voelker
ray.voelker@gmail.com | ray.voelker@chpl.org
github.com/rayvoelker