What do data curators care about? Data quality, user trust, and the data reuse plan

Author : Frank Andreas Sposito

Data curation is often defined as the practice of maintaining, preserving, and enhancing research data for long-term value and reusability. The role of data reuse in the data curation lifecycle is critical: increased reuse is the core justification for the often sizable expenditures necessary to build data management infrastructures and user services.

Yet recent studies have shown that data are being shared and reused through open data repositories at much lower levels than expected. These studies underscore a fundamental and often overlooked challenge in research data management that invites deeper examination of the roles and responsibilities of data curators.

This presentation will identify key barriers to data reuse, data quality and user trust, and propose a framework for implementing reuser-centric strategies to increase data reuse.

Using the concept of a « data reuse plan » it will highlight repository-based approaches to improve data quality and user trust, and address critical areas for innovation for data curators working in the absence of repository support.

URL : What do data curators care about? Data quality, user trust, and the data reuse plan

Alternative location : http://library.ifla.org/id/eprint/1797

 

Understanding Perspectives on Sharing Neutron Data at Oak Ridge National Laboratory

Authors : Devan Ray Donaldson, Shawn Martin, Thomas Proffen

Even though the importance of sharing data is frequently discussed, data sharing appears to be limited to a few fields, and practices within those fields are not well understood. This study examines perspectives on sharing neutron data collected at Oak Ridge National Laboratory’s neutron sources.

Operation at user facilities has traditionally focused on making data accessible to those who create them. The recent emphasis on open data is shifting the focus to ensure that the data produced are reusable by others.

This mixed methods research study included a series of surveys and focus group interviews in which 13 data consumers, data managers, and data producers answered questions about their perspectives on sharing neutron data.

Data consumers reported interest in reusing neutron data for comparison/verification of results against their own measurements and testing new theories using existing data. They also stressed the importance of establishing context for data, including how data are produced, how samples are prepared, units of measurement, and how temperatures are determined.

Data managers expressed reservations about reusing others’ data because they were not always sure if they could trust whether the people responsible for interpreting data did so correctly.

Data producers described concerns about their data being misused, competing with other users, and over-reliance on data producers to understand data. We present the Consumers Managers Producers (CMP) Model for understanding the interplay of each group regarding data sharing.

We conclude with policy and system recommendations and discuss directions for future research.

URL : Understanding Perspectives on Sharing Neutron Data at Oak Ridge National Laboratory

DOI : http://doi.org/10.5334/dsj-2017-035

On the Reuse of Scientific Data

Authors : Irene V. Pasquetto, Bernadette M. Randles, Christine L. Borgman

While science policy promotes data sharing and open data, these are not ends in themselves. Arguments for data sharing are to reproduce research, to make public assets available to the public, to leverage investments in research, and to advance research and innovation.

To achieve these expected benefits of data sharing, data must actually be reused by others. Data sharing practices, especially motivations and incentives, have received far more study than has data reuse, perhaps because of the array of contested concepts on which reuse rests and the disparate contexts in which it occurs.

Here we explicate concepts of data, sharing, and open data as a means to examine data reuse. We explore distinctions between use and reuse of data.

Lastly we propose six research questions on data reuse worthy of pursuit by the community: How can uses of data be distinguished from reuses? When is reproducibility an essential goal? When is data integration an essential goal? What are the tradeoffs between collecting new data and reusing existing data? How do motivations for data collection influence the ability to reuse data? How do standards and formats for data release influence reuse opportunities?

We conclude by summarizing the implications of these questions for science policy and for investments in data reuse.

URL : On the Reuse of Scientific Data

DOI : http://doi.org/10.5334/dsj-2017-008

Data Reuse as a Prisoner’s Dilemma: the social capital of open science

Author : Bradly Alicea

Participation in Open Data initiatives require two semi-independent actions: the sharing of data produced by a researcher or group, and a consumer of shared data. Consumers of shared data range from people interested in validating the results of a given study to transformers of the data.

These transformers can add value to the dataset by extracting new relationships and information. The relationship between producers and consumers can be modeled in a game-theoretic context, namely by using a Prisoners’ Dilemma (PD) model to better understand potential barriers and benefits of sharing.

In this paper, we will introduce the problem of data sharing, consider assumptions about economic versus social payoffs, and provide simplistic payoff matrices of data sharing.

Several variations on the payoff matrix are given for different institutional scenarios, ranging from the ubiquitous acceptance of Open Science principles to a context where the standard is entirely non-cooperative. Implications for building a CC-BY economy are then discussed in context.

URL : Data Reuse as a Prisoner’s Dilemma: the social capital of open science

DOI : https://doi.org/10.1101/093518

Research Data Reusability: Conceptual Foundations, Barriers and Enabling Technologies

Author : Costantino Thanos

High-throughput scientific instruments are generating massive amounts of data. Today, one of the main challenges faced by researchers is to make the best use of the world’s growing wealth of data. Data (re)usability is becoming a distinct characteristic of modern scientific practice.

By data (re)usability, we mean the ease of using data for legitimate scientific research by one or more communities of research (consumer communities) that is produced by other communities of research (producer communities).

Data (re)usability allows the reanalysis of evidence, reproduction and verification of results, minimizing duplication of effort, and building on the work of others. It has four main dimensions: policy, legal, economic and technological. The paper addresses the technological dimension of data reusability.

The conceptual foundations of data reuse as well as the barriers that hamper data reuse are presented and discussed. The data publication process is proposed as a bridge between the data author and user and the relevant technologies enabling this process are presented.

URL : Research Data Reusability: Conceptual Foundations, Barriers and Enabling Technologies

DOI : http://dx.doi.org/10.3390/publications5010002

Reproducible and reusable research: Are journal data sharing policies meeting the mark?

Author : Nicole A Vasilevsky, Jessica Minnier, Melissa A Haendel, Robin E Champieux

Background

There is wide agreement in the biomedical research community that research data sharing is a primary ingredient for ensuring that science is more transparent and reproducible.

Publishers could play an important role in facilitating and enforcing data sharing; however, many journals have not yet implemented data sharing policies and the requirements vary widely across journals. This study set out to analyze the pervasiveness and quality of data sharing policies in the biomedical literature.

Methods

The online author’s instructions and editorial policies for 318 biomedical journals were manually reviewed to analyze the journal’s data sharing requirements and characteristics.

The data sharing policies were ranked using a rubric to determine if data sharing was required, recommended, required only for omics data, or not addressed at all. The data sharing method and licensing recommendations were examined, as well any mention of reproducibility or similar concepts.

The data was analyzed for patterns relating to publishing volume, Journal Impact Factor, and the publishing model (open access or subscription) of each journal.

Results

11.9% of journals analyzed explicitly stated that data sharing was required as a condition of publication. 9.1% of journals required data sharing, but did not state that it would affect publication decisions. 23.3% of journals had a statement encouraging authors to share their data but did not require it.

There was no mention of data sharing in 31.8% of journals. Impact factors were significantly higher for journals with the strongest data sharing policies compared to all other data sharing mark categories. Open access journals were not more likely to require data sharing than subscription journals.

Discussion

Our study confirmed earlier investigations which observed that only a minority of biomedical journals require data sharing, and a significant association between higher Impact Factors and journals with a data sharing requirement.

Moreover, while 65.7% of the journals in our study that required data sharing addressed the concept of reproducibility, as with earlier investigations, we found that most data sharing policies did not provide specific guidance on the practices that ensure data is maximally available and reusable.

URL : Reproducible and reusable research: Are journal data sharing policies meeting the mark?

DOI : https://peerj.com/articles/3208/

 

Data trajectories: tracking reuse of published data for transitive credit attribution

Author : Paolo Missier

The ability to measure the use and impact of published data sets is key to the success of the open data/open science paradigm. A direct measure of impact would require tracking data (re)use in the wild, which is difficult to achieve.

This is therefore commonly replaced by simpler metrics based on data download and citation counts. In this paper we describe a scenario where it is possible to track the trajectory of a dataset after its publication, and show how this enables the design of accurate models for ascribing credit to data originators.

A Data Trajectory (DT) is a graph that encodes knowledge of how, by whom, and in which context data has been re-used, possibly after several generations. We provide a theoretical model of DTs that is grounded in the W3C PROV data model for provenance, and we show how DTs can be used to automatically propagate a fraction of the credit associated with transitively derived datasets, back to original data contributors.

We also show this model of transitive credit in action by means of a Data Reuse Simulator. In the longer term, our ultimate hope is that credit models based on direct measures of data reuse will provide further incentives to data publication.

We conclude by outlining a research agenda to address the hard questions of creating, collecting, and using DTs systematically across a large number of data reuse instances in the wild.

URL : Data trajectories: tracking reuse of published data for transitive credit attribution

URL : http://dx.doi.org/10.2218/ijdc.v11i1.425