A finding nobody can reproduce isn’t evidence — it’s an assertion. Reproducible pipelines mean every number we publish can be traced to its source and regenerated, by us or by a skeptic, with one command.
We design pipelines around a few durable principles, drawn from The Turing Way and Sandve et al.’s Ten Simple Rules for Reproducible Computational Research:
- One command, end to end. Acquisition → cleaning → validation → output is automated with workflow tools like Snakemake or
make, not hand-run steps. - Version everything. Code in Git; data and models in DVC; nothing depends on a file only one person has.
- Pin the environment. Docker or lockfiles (conda/uv) so the pipeline runs the same next year.
- Validate in place. Schema and range checks (Frictionless, Great Expectations) catch bad inputs before they reach a chart.
- Literate output. Quarto and Jupyter documents interleave code, results, and prose.
The payoff is accountability: a reproducible pipeline can be audited, corrected, and re-run when the source data updates — and handed to someone else without a phone call.
See the data liberation toolkit (these principles, packaged), our tooling & code, and how outputs become durable archives. Bring a workflow to the help desk.