Measuring and Mapping Data Reuse: Findings From an Interactive Workshop on Data Citation and Metrics for Data Reuse

Author : Lisa Federer

Widely adopted standards for data citation are foundational to efforts to track and quantify data reuse. Without the means to track data reuse and metrics to measure its impact, it is difficult to reward researchers who share high-value data with meaningful credit for their contribution.

Despite initial work on developing guidelines for data citation and metrics, standards have not yet been universally adopted. This article reports on the recommendations collected from a workshop held at the Future of Research Communications and e-Scholarship (FORCE11) 2018 meeting titled Measuring and Mapping Data Reuse: An Interactive Workshop on Metrics for Data.

A range of stakeholders were represented among the participants, including publishers, researchers, funders, repository administrators, librarians, and others.

Collectively, they generated a set of 68 recommendations for specific actions that could be taken by standards and metrics creators; publishers; repositories; funders and institutions; creators of reference management software and citation styles; and researchers, students, and librarians.

These specific, concrete, and actionable recommendations would help facilitate broader adoption of standard citation mechanisms and easier measurement of data reuse.

URL : Measuring and Mapping Data Reuse: Findings From an Interactive Workshop on Data Citation and Metrics for Data Reuse


The History and Future of Data Citation in Practice

Authors : Mark A. Parsons, Ruth E. Duerr, Matthew B. Jones

In this review, we adopt the definition that ‘Data citation is a reference to data for the purpose of credit attribution and facilitation of access to the data’ (TGDCSP 2013: CIDCR6). Furthermore, access should be enabled for both humans and machines (DCSG 2014).

We use this to discuss how data citation has evolved over the last couple of decades and to highlight issues that need more research and attention.

Data citation is not a new concept, but it has changed and evolved considerably since the beginning of the digital age. Basic practice is now established and slowly but increasingly being implemented.

Nonetheless, critical issues remain. These issues are primarily because we try to address multiple human and computational concerns with a system originally designed in a non-digital world for more limited use cases.

The community is beginning to challenge past assumptions, separate the multiple concerns (credit, access, reference, provenance, impact, etc.), and apply different approaches for different use cases.

URL : The History and Future of Data Citation in Practice


Reproducible data citations for computational research

Author : Christian Schulz

The general purpose of a scientific publication is the exchange and spread of knowledge. A publication usually reports a scientific result and tries to convince the reader that it is valid.

With an ever-growing number of papers relying on computational methods that make use of large quantities of data and sophisticated statistical modeling techniques, a textual description of the result is often not enough for a publication to be transparent and reproducible.

While there are efforts to encourage sharing of code and data, we currently lack conventions for linking data sources to a computational result that is stated in the main publication text or used to generate a figure or table.

Thus, here I propose a data citation format that allows for an automatic reproduction of all computations. A data citation consists of a descriptor that refers to the functional program code and the input that generated the result.

The input itself may be a set of other data citations, such that all data transformations, from the original data sources to the final result, are transparently expressed by a directed graph.

Functions can be implemented in a variety of programming languages since data sources are expected to be stored in open and standardized text-based file formats.

A publication is then an online file repository consisting of a Hypertext Markup Language (HTML) document and additional data and code source files, together with a summarization of all data sources, similar to a list of references in a bibliography.


A Data Citation Roadmap for Scholarly Data Repositories

Authors : Martin Fenner, Mercè Crosas, Jeffrey S. Grethe, David Kennedy, Henning Hermjakob, Phillippe Rocca-Serra, Gustavo Durand, Robin Berjon, Sebastian Karcher, Maryann Martone, Tim Clark

This article presents a practical roadmap for scholarly data repositories to implement data citation in accordance with the Joint Declaration of Data Citation Principles, a synopsis and harmonization of the recommendations of major science policy bodies.

The roadmap was developed by the Repositories Expert Group, as part of the Data Citation Implementation Pilot (DCIP) project, an initiative of and the NIH BioCADDIE ( program.

The roadmap makes 11 specific recommendations, grouped into three phases of implementation: a) required steps needed to support the Joint Declaration of Data Citation Principles, b) recommended steps that facilitate article/data publication workflows, and c) optional steps that further improve data citation support provided by data repositories.

URL : A Data Citation Roadmap for Scholarly Data Repositories



Theory and Practice of Data Citation

Author : Gianmaria Silvello

Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming “data-intensive”, where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets.

Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has.

The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation.

The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.


Experiences in integrated data and research object publishing using GigaDB

Authors : Scott C Edmunds, Peter Li, Christopher I Hunter, Si Zhe Xiao, Robert L Davidson, Nicole Nogoy, Laurie Goodman

In the era of computation and data-driven research, traditional methods of disseminating research are no longer fit-for-purpose. New approaches for disseminating data, methods and results are required to maximize knowledge discovery.

The “long tail” of small, unstructured datasets is well catered for by a number of general-purpose repositories, but there has been less support for “big data”. Outlined here are our experiences in attempting to tackle the gaps in publishing large-scale, computationally intensive research.

GigaScience is an open-access, open-data journal aiming to revolutionize large-scale biological data dissemination, organization and re-use. Through use of the data handling infrastructure of the genomics centre BGI, GigaScience links standard manuscript publication with an integrated database (GigaDB) that hosts all associated data, and provides additional data analysis tools and computing resources.

Furthermore, the supporting workflows and methods are also integrated to make published articles more transparent and open. GigaDB has released many new and previously unpublished datasets and data types, including as urgently needed data to tackle infectious disease outbreaks, cancer and the growing food crisis.

Other “executable” research objects, such as workflows, virtual machines and software from several GigaScience articles have been archived and shared in reproducible, transparent and usable formats.

With data citation producing evidence of, and credit for, its use in the wider research community, GigaScience demonstrates a move towards more executable publications. Here data analyses can be reproduced and built upon by users without coding backgrounds or heavy computational infrastructure in a more democratized manner.

URL : Experiences in integrated data and research object publishing using GigaDB