Data Sustainability and Reuse Pathways of Natural Resources and Environmental Scientists

Author : Yi Shen

This paper presents a multifarious examination of natural resources and environmental scientists’ adventures navigating the policy change towards open access and cultural shift in data management, sharing, and reuse.

Situated in the institutional context of Virginia Tech, a focus group and multiple individual interviews were conducted exploring the domain scientists’ all-around experiences, performances, and perspectives on their collection, adoption, integration, preservation, and management of data.

The results reveal the scientists’ struggles, concerns, and barriers encountered, as well as their shared values, beliefs, passions, and aspirations when working with data. Based on these findings, this study provides suggestions on data modeling and knowledge representation strategies to support the long-term viability, stewardship, accessibility, and sustainability of scientific data.

It also discusses the art of curation as creative scholarship and new opportunities for data librarians and information professionals to mobilize the data revolution.


Data sharing and reanalysis of randomized controlled trials in leading biomedical journals with a full data sharing policy: survey of studies published in The BMJ and PLOS Medicine

Authors : Florian Naudet, Charlotte Sakarovitch, Perrine Janiaud, Ioana Cristea, Daniele Fanelli, David Moher, John P A Ioannidis


To explore the effectiveness of data sharing by randomized controlled trials (RCTs) in journals with a full data sharing policy and to describe potential difficulties encountered in the process of performing reanalyses of the primary outcomes.


Survey of published RCTs.



Eligibility criteria

RCTs that had been submitted and published by The BMJ and PLOS Medicine subsequent to the adoption of data sharing policies by these journals.

Main outcome measure

The primary outcome was data availability, defined as the eventual receipt of complete data with clear labelling. Primary outcomes were reanalyzed to assess to what extent studies were reproduced. Difficulties encountered were described.


37 RCTs (21 from The BMJ and 16 from PLOS Medicine) published between 2013 and 2016 met the eligibility criteria. 17/37 (46%, 95% confidence interval 30% to 62%) satisfied the definition of data availability and 14 of the 17 (82%, 59% to 94%) were fully reproduced on all their primary outcomes. Of the remaining RCTs, errors were identified in two but reached similar conclusions and one paper did not provide enough information in the Methods section to reproduce the analyses. Difficulties identified included problems in contacting corresponding authors and lack of resources on their behalf in preparing the datasets. In addition, there was a range of different data sharing practices across study groups.


Data availability was not optimal in two journals with a strong policy for data sharing. When investigators shared data, most reanalyses largely reproduced the original results. Data sharing practices need to become more widespread and streamlined to allow meaningful reanalyses and reuse of data.


Recommendations to Improve Downloads of Large Earth Observation Data

Authors : Rahul Ramachandran, Christopher Lynnes, Kathleen Baynes, Kevin Murphy, Jamie Baker, Jamie Kinney, Ariel Gold, Jed Sundwall, Mark Korver, Allison Lieber, William Vambenepe, Matthew Hancher,  Rebecca Moore, Tyler Erickson, Josh Henretig,
Brant Zwiefel, Heather Patrick-Ahlstrom, Matthew J. Smith

With the volume of Earth observation data expanding rapidly, cloud computing is quickly changing the way these data are processed, analyzed, and visualized. Collocating freely available Earth observation data on a cloud computing infrastructure may create opportunities unforeseen by the original data provider for innovation and value-added data re-use, but existing systems at data centers are not designed for supporting requests for large data transfers.

A lack of common methodology necessitates that each data center handle such requests from different cloud vendors differently. Guidelines are needed to support enabling all cloud vendors to utilize a common methodology for bulk-downloading data from data centers, thus preventing the providers from building custom capabilities to meet the needs of individual vendors.

This paper presents recommendations distilled from use cases provided by three cloud vendors (Amazon, Google, and Microsoft) and are based on the vendors’ interactions with data systems at different Federal agencies and organizations.

These specific recommendations range from obvious steps for improving data usability (such as ensuring the use of standard data formats and commonly supported projections) to non-obvious undertakings important for enabling bulk data downloads at scale.

These recommendations can be used to evaluate and improve existing data systems for high-volume data transfers, and their adoption can lead to cloud vendors utilizing a common methodology.

URL : Recommendations to Improve Downloads of Large Earth Observation Data



Data Sharing: Convert Challenges into Opportunities

Author : Ana Sofia Figueiredo

Initiatives for sharing research data are opportunities to increase the pace of knowledge discovery and scientific progress. The reuse of research data has the potential to avoid the duplication of data sets and to bring new views from multiple analysis of the same data set.

For example, the study of genomic variations associated with cancer profits from the universal collection of such data and helps in selecting the most appropriate therapy for a specific patient. However, data sharing poses challenges to the scientific community.

These challenges are of ethical, cultural, legal, financial, or technical nature. This article reviews the impact that data sharing has in science and society and presents guidelines to improve the efficient sharing of research data.

URL : Data Sharing: Convert Challenges into Opportunities


Amplifying Data Curation Efforts to Improve the Quality of Life Science Data

Authors : Mariam Alqasab, Suzanne M. Embury, Sandra de F. Mendes Sampaio

In the era of data science, datasets are shared widely and used for many purposes unforeseen by the original creators of the data. In this context, defects in datasets can have far reaching consequences, spreading from dataset to dataset, and affecting the consumers of data in ways that are hard to predict or quantify.

Some form of waste is often the result. For example, scientists using defective data to propose hypotheses for experimentation may waste their limited wet lab resources chasing the wrong experimental targets. Scarce drug trial resources may be used to test drugs that actually have little chance of giving a cure.

Because of the potential real world costs, database owners care about providing high quality data. Automated curation tools can be used to an extent to discover and correct some forms of defect.

However, in some areas human curation, performed by highly-trained domain experts, is needed to ensure that the data represents our current interpretation of reality accurately.

Human curators are expensive, and there is far more curation work to be done than there are curators available to perform it. Tools and techniques are needed to enable the full value to be obtained from the curation effort currently available.

In this paper,we explore one possible approach to maximising the value obtained from human curators, by automatically extracting information about data defects and corrections from the work that the curators do.

This information is packaged in a source independent form, to allow it to be used by the owners of other databases (for which human curation effort is not available or is insufficient).

This amplifies the efforts of the human curators, allowing their work to be applied to other sources, without requiring any additional effort or change in their processes or tool sets. We show that this approach can discover significant numbers of defects, which can also be found in other sources.

URL : Amplifying Data Curation Efforts to Improve the Quality of Life Science Data


A review of data sharing statements in observational studies published in the BMJ: A cross-sectional study

Authors : Laura McDonald, Anna Schultze, Alex Simpson, Sophie Graham, Radek Wasiak, Sreeram V. Ramagopalan

In order to understand the current state of data sharing in observational research studies, we reviewed data sharing statements of observational studies published in a general medical journal, the British Medical Journal.

We found that the majority (63%) of observational studies published between 2015 and 2017 included a statement that implied that data used in the study could not be shared. If the findings of our exploratory study are confirmed, room for improvement in the sharing of real-world or observational research data exists.

URL : A review of data sharing statements in observational studies published in the BMJ: A cross-sectional study