dbt, explained with our OverDrive checkouts

how a pile of raw vendor data becomes something you can answer questions with

Ray Voelker
Cincinnati & Hamilton County Public Library
ray.voelker@chpl.org

What dbt is

A data pipeline has three jobs: Extract (pull the data), Load (store it), and Transform (make it usable) — ETL. dbt does the Transform — turning raw, loaded data into something you can answer questions with.
dbt is a build tool for SQL. You write each transformation as one SELECT — called a model — and dbt turns it into a table, in the right order, tested and documented.
It has no database of its own. It's the chef, not the stove.
The vendor hands us a messy pile of JSON; one model turns it into a clean table:

-- model: checkouts  (one SELECT = one table)
select checkoutId as checkout_id,
       title, branch, format,
       try_strptime(checkoutDateUtc,
                    '%m/%d/%Y %H:%M:%S') as checked_out
from read_json('overdrive_*/page_*.json')

ref(): it figures out the order

Enrichment = combining. To get checkouts by audience rating, we join checkouts to a title dimension.
You point one model at another with ref(), and dbt works out what to build first — and draws the map.

from {{ ref('checkouts') }} c
join {{ ref('dim_title') }} d using (title_id)

raw → checkouts → dim_title ↘ daily_by_audience
Arranging data in layers like this — raw → refined → enriched — has a name: medallion. dbt moves data between the layers. (More in its own issue.)

Incremental: only touch what's new

OverDrive is pulled twice a day on an overlapping 2-day window, so the same checkout lands in ~4 files.
An incremental model keeps 59,770 unique checkouts — and a re-run adds 0 rows.
Meaningful counts require this dedup; otherwise every number is inflated.

Tests: the data checks itself

Before you trust an enriched number, dbt checks it. A test is a rule that runs on every build:

columns:
  - name: checkout_id
    tests: [not_null, unique]

No two checkouts share an ID. If that's ever false, the build stops — a bad number never ships.

Lineage: see the whole pipeline

One command — dbt docs generate — draws the dependency graph and the documentation.
You can trace checkouts by audience rating all the way back to the raw vendor file.
That traceability matters a lot when the data is about customers.

dbt-duckdb: all on one machine

DuckDB is "SQLite for analytics" — a fast database that runs inside the program, no server. dbt-duckdb plugs dbt into it.
The whole enrichment pipeline runs in seconds, on one machine, against our own data — no warehouse, no cloud bill.
The big-company toolchain, library-sized — and the same recipe enriches Hoopla checkouts too.