How to share data for collaboration

Authors : Shannon E Ellis, Jeffrey T Leek

Within the statistics community, a number of guiding principles for sharing data have emerged; however, these principles are not always made clear to collaborators generating the data. To bridge this divide, we have established a set of guidelines for sharing data.

In these, we highlight the need to provide raw data to the statistician, the importance of consistent formatting, and the necessity of including all essential experimental information and pre-processing steps carried out to the statistician. With these guidelines we hope to avoid errors and delays in data analysis.

URL : How to share data for collaboration

DOI : https://doi.org/10.7287/peerj.preprints.3139v1

 

Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

Authors : Julie A. McMurry, Nick Juty, Niklas Blomberg, Tony Burdett, Tom Conlin, Nathalie Conte, Mélanie Courtot, John Deck, Michel Dumontier, Donal K. Fellows, Alejandra Gonzalez-Beltran, Philipp Gormanns, Jeffrey Grethe, Janna Hastings, Jean-Karim Hériché, Henning Hermjakob, Jon C. Ison, Rafael C. Jimenez, Simon Jupp, John Kunze, Camille Laibe, Nicolas Le Novère, James Malone, Maria Jesus Martin, Johanna R. McEntyre, Chris Morris, Juha Muilu, Wolfgang Müller, Philippe Rocca-Serra, Susanna-Assunta Sansone, Murat Sariyar, Jacky L. Snoep, Stian Soiland-Reyes, Natalie J. Stanford, Neil Swainston, Nicole Washington, Alan R. Williams, Sarala M. Wimalaratne, Lilly M. Winfree, Katherine Wolstencroft, Carole Goble, Christopher J. Mungall, Melissa A. Haendel, Helen Parkinson

In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure.

Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers.

We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability.

We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.

URL : Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

DOI : https://doi.org/10.1371/journal.pbio.2001414

The legal and policy framework for scientific data sharing, mining and reuse

Author : Mélanie Dulong de Rosnay

Text and Data Mining, the automatic processing of large amounts of scientific articles and datasets, is an essential practice for contemporary researchers. Some publishers are challenging it as a lawful activity and the topic is being discussed during European copyright law reform process.

In order to better understand the underlying debate and contribute to the policy discussion, this article first examines the legal status of data access and reuse and licensing policies. It then presents available options supporting the exercise of Text and Data Mining: publication under open licenses, open access legislations and a recognition of the legitimacy of the activity.

For that purpose, the paper analyses the scientific rational for sharing and its legal and technical challenges and opportunities. In particular, it surveys existing open access and open data legislations and discusses implementation in European and Latin America jurisdictions.

Framing Text and Data mining as an exception to copyright could be problematic as it de facto denies that this activity is part of a positive right to read and should not require additional permission nor licensing.

It is crucial in licenses and legislations to provide a correct definition of what is Open Access, and to address the question of pre-existing copyright agreements. Also, providing implementation means and technical support is key. Otherwise, legislations could remain declarations of good principles if repositories are acting as empty shells.

URL ; https://books.openedition.org/editionsmsh/9082

Scientific data from and for the citizen

Authors : Sven Schade, Chrisa Tsinaraki, Elena Roglia

Powered by advances of technology, today’s Citizen Science projects cover a wide range of thematic areas and are carried out from local to global levels. This wealth of activities creates an abundance of data, for example, in the forms of observations submitted by mobile phones; readings of low-cost sensors; or more general information about peoples’ activities.

The management and possible sharing of this data has become a research topic in its own right. We conducted a survey in the summer of 2015 in order to collectively analyze the state of play in Citizen Science.

This paper summarizes our main findings related to data access, standardization and data preservation. We provide examples of good practices in each of these areas and outline actions to address identified challenges.

URL : http://firstmonday.org/ojs/index.php/fm/article/view/7842

Searching Data: A Review of Observational Data Retrieval Practices

Authors : Kathleen Gregory, Paul Groth, Helena Cousijn, Andrea Scharnhorst, Sally Wyatt

A cross-disciplinary examination of the user behaviours involved in seeking and evaluating data is surprisingly absent from the research data discussion. This review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data.

Two analytical frameworks rooted in information retrieval and science technology studies are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.

URL : https://arxiv.org/abs/1707.06937

 

Journal Data Sharing Policies and Statistical Reporting Inconsistencies in Psychology

Authors : Michele Nuijten, Jeroen Borghuis, Coosje Veldkamp, Linda Alvarez, Marcel van Assen, Jelte Wicherts

In this paper, we present three studies that investigate the relation between data sharing and statistical reporting inconsistencies. Previous research found that reluctance to share data was related to a higher prevalence of statistical errors, often in the direction of statistical significance (Wicherts, Bakker, & Molenaar, 2011).

We therefore hypothesized that journal policies about data sharing and data sharing itself would reduce these inconsistencies. In Study 1, we compared the prevalence of reporting inconsistencies in two similar journals on decision making with different data sharing policies.

In Study 2, we compared reporting inconsistencies in articles published in PLOS (with a data sharing policy) and Frontiers in Psychology (without a data sharing policy). In Study 3, we looked at papers published in the journal Psychological Science to check whether papers with or without an Open Practice Badge differed in the prevalence of reporting errors.

Overall, we found no relationship between data sharing and reporting inconsistencies. We did find that journal policies on data sharing are extremely effective in promoting data sharing.

We argue that open data is essential in improving the quality of psychological science, and we discuss ways to detect and reduce reporting inconsistencies in the literature.

DOI : https://dx.doi.org/10.17605/OSF.IO/SGBTA