Reproducible pipelines

A finding nobody can reproduce isn’t evidence — it’s an assertion. Reproducible pipelines mean every number we publish can be traced to its source and regenerated, by us or by a skeptic, with one command.

We design pipelines around a few durable principles, drawn from The Turing Way and Sandve et al.’s Ten Simple Rules for Reproducible Computational Research:

One command, end to end. Acquisition → cleaning → validation → output is automated with workflow tools like Snakemake or make, not hand-run steps.
Version everything. Code in Git; data and models in DVC; nothing depends on a file only one person has.
Pin the environment. Docker or lockfiles (conda/uv) so the pipeline runs the same next year.
Validate in place. Schema and range checks (Frictionless, Great Expectations) catch bad inputs before they reach a chart.
Literate output. Quarto and Jupyter documents interleave code, results, and prose.

The payoff is accountability: a reproducible pipeline can be audited, corrected, and re-run when the source data updates — and handed to someone else without a phone call.

See the data liberation toolkit (these principles, packaged), our tooling & code, and how outputs become durable archives. Bring a workflow to the help desk.