I ❤️ Datasette

Collection Analysis & the Newsdex Project at CHPL

Ray Voelker
Cincinnati & Hamilton County Public Library
ray.voelker@chpl.org

Your Data Is Trapped

  • Libraries sit on TONS of structured data
  • ILS systems weren't designed for flexible, arbitrary queries and analysis tasks
  • The people who work with the collection best often have the least direct access to the data

So What Do People Do?

  • Wrestling with complicated data pipelines / stale spreadsheets
  • Rely on one or two staff to produce reports
  • Scrape the Public WebPAC (this is especially bad)

What If You Could Just... Browse It?

  • Browse the entire collection metadata in a web browser
  • Filter by location, format, status, date — instantly
  • Run pre-built queries without writing SQL
  • Export results to CSV
  • Access a JSON API for advanced use

collection-analysis.cincy.pl

collection-analysis.cincy.pl

What is Datasette?

  • Created by Simon Willison (co-creator of Django)
  • "A tool for exploring and publishing data"
  • Aimed at journalists, museum curators, archivists, local governments, and anyone else who has data
  • Read-only by default — safe to hand to anyone

datasette.io

Things I ❤️ about Datasette

  • Written in Python / Available via PyPI.org
  • Easy and intuitive to use
  • Well-documented
  • Built-in API... THE API IS SQL!
  • Open source + supportive developer
  • Flexible deployment with CHEAP hosting options

The Pipeline


Sierra ILS PostgreSQL (Sierra DB)
    ↓
Python Scripts (sqlite-utils)
    ↓
SQLite Database
    ↓
Datasette → collection-analysis.cincy.pl
					

github.com/cincinnatilibrary/collection-analysis

Demo Time!

Hosting: $5 a Month. Seriously.

  • DigitalOcean Droplet: $5/month
    • 1 vCPU, 1 GB RAM, 25 GB SSD
  • Apache reverse proxy (free)
  • Let's Encrypt TLS (free)
  • Datasette + Python (free, open source)

Once You Have the Pattern, Everything Is a Dataset

  • ILS collection data ✓
  • Newspaper indexes (Newsdex) — let me show you...
Chalmers Hadley

"News reports of major local happenings would be matters of history some day and that the proper indexing of such items in time would be valuable"

— Chalmers Hadley, Head Librarian, CHPL (1927)

  • Librarians hand-typed index cards for newspaper articles
  • Each card: title, newspaper, date, page, column, subject
  • Coverage: Enquirer, Post, Herald, Cincinnati Magazine, and more
  • Selected coverage reaching back to the early 1800s
Newsdex card catalog at CHPL

Newsdex Goes Digital — Then Gets Stuck

  • March 1992: card catalog computerized as "Newsdex" in Millennium ILS
  • ~1.8 million MARC records of newspaper citations
  • CHPL migrated to Sierra — but Newsdex stayed on old Millennium server
  • Maintaining a legacy ILS alongside Sierra: costly and fragile
  • Goal: preserve the data, decommission Millennium

The Data: MARC → Something Useful

MARC Record


245    $a Chart shows rapid rise
       of river: Comparison of
       present flood with
       previous high stages
260    $c 1937
650    $a Floods
650    $a Floods (1937)
650    $a Floods (1913)
650    $a Floods (1884)
773    $a Cincinnati Enquirer
       $g 01/27/1937 6:1 pic
							

What a human wants


Title:     Chart shows rapid rise
           of river...
Newspaper: Cincinnati Enquirer
Date:      1937-01-27
Page:      6
Column:    1
Image:     Yes
Subjects:  Floods; Floods (1937);
           Floods (1913);
           Floods (1884)
							

The Newsdex Pipeline


Millennium ILS → MARC Export (.mrc)
    ↓
pymarc + sqlite-utils (Python)
    ↓
SQLite Database
    ↓
Datasette → newsdex.chpl.org
					

github.com/cincinnatilibrary/newsdex

A Century of Cincinnati News, Explorable

Resources

Thanks!

Thank you to the Genealogy & Local History Department — they created the Newsdex data over nearly 100 years.

❤️ Datasette? Let's talk!

Ray Voelker
ray.voelker@gmail.com | ray.voelker@chpl.org
github.com/rayvoelker