Reproducible data citations for computational research

Author : Christian Schulz

The general purpose of a scientific publication is the exchange and spread of knowledge. A publication usually reports a scientific result and tries to convince the reader that it is valid.

With an ever-growing number of papers relying on computational methods that make use of large quantities of data and sophisticated statistical modeling techniques, a textual description of the result is often not enough for a publication to be transparent and reproducible.

While there are efforts to encourage sharing of code and data, we currently lack conventions for linking data sources to a computational result that is stated in the main publication text or used to generate a figure or table.

Thus, here I propose a data citation format that allows for an automatic reproduction of all computations. A data citation consists of a descriptor that refers to the functional program code and the input that generated the result.

The input itself may be a set of other data citations, such that all data transformations, from the original data sources to the final result, are transparently expressed by a directed graph.

Functions can be implemented in a variety of programming languages since data sources are expected to be stored in open and standardized text-based file formats.

A publication is then an online file repository consisting of a Hypertext Markup Language (HTML) document and additional data and code source files, together with a summarization of all data sources, similar to a list of references in a bibliography.