I ❤️ Datasette

Collection Analysis & the Newsdex Project at CHPL

Ray Voelker
Cincinnati & Hamilton County Public Library
ray.voelker@chpl.org

Your Data Is Trapped

Libraries sit on TONS of structured data
ILS systems weren't designed for flexible, arbitrary queries and analysis tasks
The people who work with the collection best often have the least direct access to the data

So What Do People Do?

Wrestling with complicated data pipelines / stale spreadsheets
Rely on one or two staff to produce reports
Scrape the Public WebPAC (this is especially bad)

What If You Could Just... Browse It?

Browse the entire collection metadata in a web browser
Filter by location, format, status, date — instantly
Run pre-built queries without writing SQL
Export results to CSV
Access a JSON API for advanced use

collection-analysis.cincy.pl

What is Datasette?

Created by Simon Willison (co-creator of Django)
"A tool for exploring and publishing data"
Aimed at journalists, museum curators, archivists, local governments, and anyone else who has data
Read-only by default — safe to hand to anyone

datasette.io

Things I ❤️ about Datasette

Written in Python / Available via PyPI.org
Easy and intuitive to use
Well-documented
Built-in API... THE API IS SQL!
Open source + supportive developer
Flexible deployment with CHEAP hosting options

The Pipeline


Sierra ILS PostgreSQL (Sierra DB)
    ↓
Python Scripts (sqlite-utils)
    ↓
SQLite Database
    ↓
Datasette → collection-analysis.cincy.pl

github.com/cincinnatilibrary/collection-analysis

Demo Time!

Hosting: $5 a Month. Seriously.

DigitalOcean Droplet: $5/month

1 vCPU, 1 GB RAM, 25 GB SSD

Apache reverse proxy (free)
Let's Encrypt TLS (free)
Datasette + Python (free, open source)

Once You Have the Pattern, Everything Is a Dataset

ILS collection data ✓
Newspaper indexes (Newsdex) — let me show you...

"News reports of major local happenings would be matters of history some day and that the proper indexing of such items in time would be valuable"

— Chalmers Hadley, Head Librarian, CHPL (1927)

Librarians hand-typed index cards for newspaper articles
Each card: title, newspaper, date, page, column, subject
Coverage: Enquirer, Post, Herald, Cincinnati Magazine, and more
Selected coverage reaching back to the early 1800s

Newsdex Goes Digital — Then Gets Stuck

March 1992: card catalog computerized as "Newsdex" in Millennium ILS
~1.8 million MARC records of newspaper citations
CHPL migrated to Sierra — but Newsdex stayed on old Millennium server
Maintaining a legacy ILS alongside Sierra: costly and fragile
Goal: preserve the data, decommission Millennium

The Data: MARC → Something Useful

MARC Record


245    $a Chart shows rapid rise
       of river: Comparison of
       present flood with
       previous high stages
260    $c 1937
650    $a Floods
650    $a Floods (1937)
650    $a Floods (1913)
650    $a Floods (1884)
773    $a Cincinnati Enquirer
       $g 01/27/1937 6:1 pic

What a human wants


Title:     Chart shows rapid rise
           of river...
Newspaper: Cincinnati Enquirer
Date:      1937-01-27
Page:      6
Column:    1
Image:     Yes
Subjects:  Floods; Floods (1937);
           Floods (1913);
           Floods (1884)

The Newsdex Pipeline


Millennium ILS → MARC Export (.mrc)
    ↓
pymarc + sqlite-utils (Python)
    ↓
SQLite Database
    ↓
Datasette → newsdex.chpl.org

github.com/cincinnatilibrary/newsdex

A Century of Cincinnati News, Explorable

Resources

Thanks!

Thank you to the Genealogy & Local History Department — they created the Newsdex data over nearly 100 years.

❤️ Datasette? Let's talk!

Ray Voelker
ray.voelker@gmail.com | ray.voelker@chpl.org
github.com/rayvoelker