Building a Data Lake for Your Library

Early Experiments at CHPL

Ray Voelker
Cincinnati & Hamilton County Public Library
ray.voelker@chpl.org

Previously ...

  • Datasette + SQLite = great for browsing a single dataset
  • Collection analysis, Newsdex — each a standalone tool
  • But what if you want to ask questions across all your data?

The Real Problem: Data Never Sits Still

  • Your ILS isn't a static database — it's constantly churning
  • Items are added, weeded, checked out, status codes change — every day
  • Digital platforms (OverDrive) have their own moving data
  • To analyze any of this properly, you need structured snapshots at regular intervals

What's Missing Today

  • Pulling a report today? Great. Comparing it to last month? Hard.
  • No consistent snapshots — just ad-hoc exports when someone asks
  • No metadata about when or how the data was captured
  • Changes vanish — once a record is updated, the old version is gone

What's a Data Lake?

  • A central repository for structured data from many sources
  • Snapshot at regular intervals with proper metadata
  • Open formats (Parquet, not proprietary exports)
  • Not a data warehouse — lighter, cheaper, more flexible
  • Built-in history — every snapshot is preserved

Data Sovereignty

  • Your data belongs to you — not a vendor
  • Library vendors control how you access your own data
  • Proprietary formats, limited exports, API restrictions
  • A data lake puts you in control — open formats, your infrastructure
  • If you switch vendors, your data lake stays
Sierra ILS
OverDrive API
Future sources
MARC Records
Checkouts
Data Lake
Parquet + DuckDB
DuckLake snapshots
SQL Queries
Dashboards
CSV Exports

The Stack

  • Python — extraction scripts, MARC parsing (pymarc)
  • Parquet — open columnar format, compressed with zstd
  • DuckDB — fast analytical queries, no server needed
  • DuckLake — versioned storage with snapshots

Harvesting the ILS: sierra-marc-harvest

  • Open source Python CLI for Sierra ILS
  • Extracts MARC records via the Sierra REST API
  • Incremental updates — only fetches what changed
  • Declarative YAML profiles — institution-agnostic
  • Stores everything in compressed Parquet files

github.com/chimpy-me/ils-lake

Bibs (19 fields)

  • Title, author, publisher
  • Call number, ISBN, OCLC#
  • Subject headings, genres
  • Material type, bib level
  • Pub year, cataloging date
  • Complete MARC as JSON

Items (17 fields)

  • Barcode, location, item type
  • Status, call number
  • Checkout & renewal totals
  • Last checkout / checkin dates
  • Price, copy number, volume

The Extraction Profile

Declarative YAML — no code changes needed


tables:
  bibs:
    fields:
      title:
        source: 245$a$b
      author:
        source: [100$a, 110$a, 111$a]
        transform: first_available
      pub_year:
        source: 008[7:11]
        type: int16
      language:
        source: 008[35:38]
					

Profile Deep Dive: Selectors

Backup slide — for Q&A

MARC addressing grammar — point at the data you want:

Syntax Meaning Example
245$a$b Subfields, concatenated Title + subtitle
008[7:11] Control field slice Publication year
[100$a, 110$a, 111$a] First available source Author (personal, corporate, or meeting)
6XX$a Wildcard — all matching fields All subject headings (650, 651, etc.)

Profile Deep Dive: Transforms

Backup slide — for Q&A

Transforms are plain Python functions registered by name. Chainable: transform: [first_available, rstrip]

Value Transforms

  • sierra_id — parse .b12345678x to numeric ID
  • sierra_dateMM-DD-YY → ISO 8601
  • sierra_datetime — with time component
  • price_cents$11.161116
  • rstrip, strip, normalize_text
  • s/pattern/replacement/ — inline regex

List Transforms

  • first_available — pick first non-empty value from multiple sources
  • collect_piped — gather repeating fields into Floods|Floods (1937)|Floods (1913)

Customization

  • Add a Python function to transforms.py
  • Register it in the TRANSFORMS dict
  • Reference by name in your YAML profile
  • Validated at profile load time

Why Parquet + DuckDB?

  • Parquet: columnar, compressed — a fraction of the size of raw exports
  • DuckDB: analytical SQL engine, runs on a laptop
  • No heavy infrastructure — DuckDB is embedded, catalog can be SQLite
  • Same ecosystem as data science and journalism
  • Query across tables, across sources — just SQL

DuckLake: Version Your Data

  • DuckLake — an open data lake format from the DuckDB Foundation
  • Every harvest creates a lightweight snapshot with metadata
  • Query your collection as it was on any date
  • Track changes over time — what was added, modified, deleted
  • Built-in retention policies — expire old snapshots, reclaim storage

ducklake.select

Beyond the ILS: OverDrive Checkouts

  • Separate ETL pulls digital checkout data from OverDrive API
  • Currently stores raw API responses — next step is Parquet
  • The goal: join ILS + OverDrive data in the same lake
  • One query to compare physical and digital circulation

github.com/cincinnatilibrary/chpl-etl-overdrive-checkouts

Where We Are (Honestly)

  • Sierra ILS harvest: working, tested — extracting real data into a test data lake
  • OverDrive ETL: working prototype — storing snapshots in Python pickle data on disk
  • Connecting the pieces into a unified lake: in progress
  • This is early — sharing the concept, not a finished product

What's Next

  • More data sources — holds, circ history, patron demographics
  • Pre-built queries and dashboards (Datasette or similar)
  • Scheduled harvests via systemd timers or cron
  • Documentation + packaging for other Sierra libraries

The Takeaway

  • Your library data doesn't have to live in silos
  • Open formats + free tools = a real data lake on a budget
  • Start small — one pipeline, one data source
  • The pattern scales: add sources as you go

Resources

Thanks!

Questions? Ideas? Other data sources you'd want in your lake?

This is early-stage work — feedback welcome!

Ray Voelker
ray.voelker@gmail.com | ray.voelker@chpl.org
github.com/rayvoelker