Every dataset we archive ships with a documentation guide so the next person can trust it without re-deriving where it came from. A good guide answers four questions: what the data measures, where and when it came from, how it was collected and transformed, and what it cannot tell you.
We model our guides on established standards rather than inventing our own:
- Datasheets for Datasets (Gebru et al.) — motivation, composition, collection process, and recommended uses.
- Data Statements (Bender & Friedman) — especially for text and language data.
- The Frictionless Data Package — a machine-readable schema (field types, units, constraints) that travels with the files.
- W3C PROV for provenance and Dublin Core for descriptive metadata.
Each guide pairs that structured metadata with a plain-language codebook — every field, its units, and its coded values — and a frank limitations section: known gaps, suppressed cells, definition changes across years, and the questions the data is not fit to answer. Where a series was altered or removed, we record what changed and when — the difference between a defensible story and a correction.
The goal is FAIR data — Findable, Accessible, Interoperable, Reusable — written for a non-specialist audience.
Browse the public data archive, see how documentation fits our reproducible pipelines, or ask the help desk to document a dataset with you.