The ability to measure the use and impact of published data sets is key to the success of the open data/open science paradigm. A direct measure of impact would require tracking data (re)use in the wild, which is difficult to achieve.
This is therefore commonly replaced by simpler metrics based on data download and citation counts. In this paper we describe a scenario where it is possible to track the trajectory of a dataset after its publication, and show how this enables the design of accurate models for ascribing credit to data originators.
A Data Trajectory (DT) is a graph that encodes knowledge of how, by whom, and in which context data has been re-used, possibly after several generations. We provide a theoretical model of DTs that is grounded in the W3C PROV data model for provenance, and we show how DTs can be used to automatically propagate a fraction of the credit associated with transitively derived datasets, back to original data contributors.
We also show this model of transitive credit in action by means of a Data Reuse Simulator. In the longer term, our ultimate hope is that credit models based on direct measures of data reuse will provide further incentives to data publication.
We conclude by outlining a research agenda to address the hard questions of creating, collecting, and using DTs systematically across a large number of data reuse instances in the wild.
Auteurs/Authors : Stefan Buddenbohm, Nathanael Cretin, Elly Dijk, Bertrand Gaie, Maaike De Jong, Jean-Luc Minel, Blandine Nouvel
Publishing research data as open data is not yet common practice for researchers in the arts and humanities, and lags behind other scientific fields, such as the natural sciences. Moreover, even when humanities researchers publish their data in repositories and archives, these data are often hard to find and use by other researchers in the field.
The goal of Work Package 7 of the the HaS (Humanities at Scale) DARIAH project is to develop an open humanities data platform for the humanities. Work in task 7.1 is a joint effort of Data Archiving and Networked Services (DANS), Centre National de la Recherche Scientifique (CNRS) and the University of Göttingen – State and University Library (UGOE-SUB).
This report gives an overview of the various aspects that are connected to open access publishing of research data in the humanities. After the introduction, where we give definitions of key concepts, we describe the research data life cycle.
We present an overview of the different stakeholders involved and we look into advantages and obstacles for researchers to share research data. Furthermore, a description of the European data repositories is given, followed by certification standards of trusted digital data repositories.
The possibility of data citation is important for sharing open data and is also described in this report. We also discuss the standards and use of metadata in the humanities. Finally, we discuss best practice example of open access research data system in the humanities: the French open research data ecosystem.
With this report we provide information and guidance on open access publishing of humanities research data for researchers. The report is the result of a desk study towards the current state of open access research data and the specific challenges for humanities. It will serve as input for Task 7.2., which will deliver a design and sustainability plan for an open humanities data platform, and for Task 7.3, which will deliver this platform.
Reproducibility and reusability of research results is an important concern in scientific communication and science policy. A foundational element of reproducibility and reusability is the open and persistently available presentation of research data.
However, many common approaches for primary data publication in use today do not achieve sufficient long-term robustness, openness, accessibility or uniformity. Nor do they permit comprehensive exploitation by modern Web technologies.
This has led to several authoritative studies recommending uniform direct citation of data archived in persistent repositories. Data are to be considered as first-class scholarly objects, and treated similarly in many ways to cited and archived scientific and scholarly literature.
Here we briefly review the most current and widely agreed set of principle-based recommendations for scholarly data citation, the Joint Declaration of Data Citation Principles (JDDCP).
We then present a framework for operationalizing the JDDCP; and a set of initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data.
The main target audience for the common implementation guidelines in this article consists of publishers, scholarly organizations, and persistent data repositories, including technical staff members in these organizations.
But ordinary researchers can also benefit from these recommendations. The guidance provided here is intended to help achieve widespread, uniform human and machine accessibility of deposited data, in support of significantly improved verification, validation, reproducibility and re-use of scholarly/scientific data.
Growing Institutional Support for Data Citation : Results of a Partnership Between Griffith University and the Australian National Data Service :
« Data is increasingly recognised as a valuable product of research and a number of international initiatives are underway to ensure it is better managed, connected, published, discovered, cited and reused. Within this context, data citation is an emergent practice rather than a norm of scholarly attribution. In 2012, a data citation project at Griffith University funded by the Australian National Data Service (ANDS) commenced that aimed to: enhance existing infrastructure for data citation at the University; test methodologies for tracking impact; and provide targeted outreach to researchers about the benefits of data citation. The project extended previous collaboration between Griffith and ANDS that built infrastructure at the University to assign DOI names (Digital Object Identifiers) to research data produced by Griffith’s researchers. This article reports on the findings of the project and provides a case study of what can be achieved at the institutional level to support data citation. »
Making Data a First Class Scientific Output: Data Citation and Publication by NERC’s Environmental Data Centres :
« The NERC Science Information Strategy Data Citation and Publication project aims to develop and formalise a method for formally citing and publishing the datasets stored in its environmental data centres. It is believed that this will act as an incentive for scientists, who often invest a great deal of effort in creating datasets, to submit their data to a suitable data repository where it can properly be archived and curated. Data citation and publication will also provide a mechanism for data producers to receive credit for their work, thereby encouraging them to share their data more freely. »
Data citation should be a necessary corollary of data publication and reuse. Many researchers are reluctant to share their data, yet they are increasingly encouraged to do just that.
Reward structures must be in place to encourage data publication, and citation is the appropriate tool for scholarly acknowledgment. Data citation also allows for the identification, retrieval, replication, and verification of data underlying published studies.
This study examines author behavior and sources of instruction in disciplinary and cultural norms for writing style and citation via a content analysis of journal articles, author instructions, style manuals, and data publishers. Instances of data citation are benchmarked against a Data Citation Adequacy Index.
Roughly half of journals point toward a style manual that addresses data citation, but the majority of journal articles failed to include an adequate citation to data used in secondary analysis studies.
Full citation of data is not currently a normative behavior in scholarly writing. Multiplicity of data types and lack of awareness regarding existing standards contribute to the problem.
Citations for data must be promoted as an essential component of data publication, sharing, and reuse. Despite confounding factors, librarians and information professionals are well-positioned and should persist in advancing data citation as a normative practice across domains.
Doing so promotes a value proposition for data sharing and secondary research broadly, thereby accelerating the pace of scientific research. »