Theory and Practice of Data Citation

Author : Gianmaria Silvello

Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming « data-intensive », where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets.

Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has.

The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation.

The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.

URL : https://arxiv.org/abs/1706.07976

Experiences in integrated data and research object publishing using GigaDB

Authors : Scott C Edmunds, Peter Li, Christopher I Hunter, Si Zhe Xiao, Robert L Davidson, Nicole Nogoy, Laurie Goodman

In the era of computation and data-driven research, traditional methods of disseminating research are no longer fit-for-purpose. New approaches for disseminating data, methods and results are required to maximize knowledge discovery.

The “long tail” of small, unstructured datasets is well catered for by a number of general-purpose repositories, but there has been less support for “big data”. Outlined here are our experiences in attempting to tackle the gaps in publishing large-scale, computationally intensive research.

GigaScience is an open-access, open-data journal aiming to revolutionize large-scale biological data dissemination, organization and re-use. Through use of the data handling infrastructure of the genomics centre BGI, GigaScience links standard manuscript publication with an integrated database (GigaDB) that hosts all associated data, and provides additional data analysis tools and computing resources.

Furthermore, the supporting workflows and methods are also integrated to make published articles more transparent and open. GigaDB has released many new and previously unpublished datasets and data types, including as urgently needed data to tackle infectious disease outbreaks, cancer and the growing food crisis.

Other “executable” research objects, such as workflows, virtual machines and software from several GigaScience articles have been archived and shared in reproducible, transparent and usable formats.

With data citation producing evidence of, and credit for, its use in the wider research community, GigaScience demonstrates a move towards more executable publications. Here data analyses can be reproduced and built upon by users without coding backgrounds or heavy computational infrastructure in a more democratized manner.

URL : Experiences in integrated data and research object publishing using GigaDB

DOI : http://link.springer.com/article/10.1007/s00799-016-0174-6

Evaluating and Promoting Open Data Practices in Open Access Journals

Authors : Eleni Castro, Mercè Crosas, Alex Garnett, Kasey Sheridan, Micah Altman

In the last decade there has been a dramatic increase in attention from the scholarly communications and research community to open access (OA) and open data practices.

These are potentially related, because journal publication policies and practices both signal disciplinary norms, and provide direct incentives for data sharing and citation. However, there is little research evaluating the data policies of OA journals.

In this study, we analyze the state of data policies in open access journals, by employing random sampling of the Directory of Open Access Journals (DOAJ) and Open Journal Systems (OJS) journal directories, and applying a coding framework that integrates both previous studies and emerging taxonomies of data sharing and citation.

This study, for the first time, reveals both the low prevalence of data sharing policies and practices in OA journals, which differs from the previous studies of commercial journals’ in specific disciplines.

URL : Evaluating and Promoting Open Data Practices in Open Access Journals

Data trajectories: tracking reuse of published data for transitive credit attribution

Author : Paolo Missier

The ability to measure the use and impact of published data sets is key to the success of the open data/open science paradigm. A direct measure of impact would require tracking data (re)use in the wild, which is difficult to achieve.

This is therefore commonly replaced by simpler metrics based on data download and citation counts. In this paper we describe a scenario where it is possible to track the trajectory of a dataset after its publication, and show how this enables the design of accurate models for ascribing credit to data originators.

A Data Trajectory (DT) is a graph that encodes knowledge of how, by whom, and in which context data has been re-used, possibly after several generations. We provide a theoretical model of DTs that is grounded in the W3C PROV data model for provenance, and we show how DTs can be used to automatically propagate a fraction of the credit associated with transitively derived datasets, back to original data contributors.

We also show this model of transitive credit in action by means of a Data Reuse Simulator. In the longer term, our ultimate hope is that credit models based on direct measures of data reuse will provide further incentives to data publication.

We conclude by outlining a research agenda to address the hard questions of creating, collecting, and using DTs systematically across a large number of data reuse instances in the wild.

URL : Data trajectories: tracking reuse of published data for transitive credit attribution

URL : http://dx.doi.org/10.2218/ijdc.v11i1.425

State of the art report on open access publishing of research data in the humanities

Auteurs/Authors : Stefan Buddenbohm, Nathanael Cretin, Elly Dijk, Bertrand Gai e, Maaike De Jong, Jean-Luc Minel, Blandine Nouvel

Publishing research data as open data is not yet common practice for researchers in the arts and humanities, and lags behind other scientific fields, such as the natural sciences. Moreover, even when humanities researchers publish their data in repositories and archives, these data are often hard to find and use by other researchers in the field.

The goal of Work Package 7 of the the HaS (Humanities at Scale) DARIAH project is to develop an open humanities data platform for the humanities. Work in task 7.1 is a joint effort of Data Archiving and Networked Services (DANS), Centre National de la Recherche Scientifique (CNRS) and the University of Göttingen – State and University Library (UGOE-SUB).

This report gives an overview of the various aspects that are connected to open access publishing of research data in the humanities. After the introduction, where we give definitions of key concepts, we describe the research data life cycle.

We present an overview of the different stakeholders involved and we look into advantages and obstacles for researchers to share research data. Furthermore, a description of the European data repositories is given, followed by certification standards of trusted digital data repositories.

The possibility of data citation is important for sharing open data and is also described in this report. We also discuss the standards and use of metadata in the humanities. Finally, we discuss best practice example of open access research data system in the humanities: the French open research data ecosystem.

With this report we provide information and guidance on open access publishing of humanities research data for researchers. The report is the result of a desk study towards the current state of open access research data and the specific challenges for humanities. It will serve as input for Task 7.2., which will deliver a design and sustainability plan for an open humanities data platform, and for Task 7.3, which will deliver this platform.

URL : https://halshs.archives-ouvertes.fr/halshs-01357208

Achieving human and machine accessibility of cited data in scholarly publications

Reproducibility and reusability of research results is an important concern in scientific communication and science policy. A foundational element of reproducibility and reusability is the open and persistently available presentation of research data.

However, many common approaches for primary data publication in use today do not achieve sufficient long-term robustness, openness, accessibility or uniformity. Nor do they permit comprehensive exploitation by modern Web technologies.

This has led to several authoritative studies recommending uniform direct citation of data archived in persistent repositories. Data are to be considered as first-class scholarly objects, and treated similarly in many ways to cited and archived scientific and scholarly literature.

Here we briefly review the most current and widely agreed set of principle-based recommendations for scholarly data citation, the Joint Declaration of Data Citation Principles (JDDCP).

We then present a framework for operationalizing the JDDCP; and a set of initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data.

The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories, including technical staff members in these organizations.

But ordinary researchers can also benefit from these recommendations. The guidance provided here is intended to help achieve widespread, uniform human and machine accessibility of deposited data, in support of significantly improved verification, validation, reproducibility and re-use of scholarly/scientific data.

URL : Achieving human and machine accessibility of cited data in scholarly publications

DOI : https://doi.org/10.7717/peerj-cs.1

Growing Institutional Support for Data Citation Results of…

Growing Institutional Support for Data Citation : Results of a Partnership Between Griffith University and the Australian National Data Service :

« Data is increasingly recognised as a valuable product of research and a number of international initiatives are underway to ensure it is better managed, connected, published, discovered, cited and reused. Within this context, data citation is an emergent practice rather than a norm of scholarly attribution. In 2012, a data citation project at Griffith University funded by the Australian National Data Service (ANDS) commenced that aimed to: enhance existing infrastructure for data citation at the University; test methodologies for tracking impact; and provide targeted outreach to researchers about the benefits of data citation. The project extended previous collaboration between Griffith and ANDS that built infrastructure at the University to assign DOI names (Digital Object Identifiers) to research data produced by Griffith’s researchers. This article reports on the findings of the project and provides a case study of what can be achieved at the institutional level to support data citation. »

URL : http://www.dlib.org/dlib/november13/simons/11simons.html