Recommendations to Improve Downloads of Large Earth Observation Data

Authors : Rahul Ramachandran, Christopher Lynnes, Kathleen Baynes, Kevin Murphy, Jamie Baker, Jamie Kinney, Ariel Gold, Jed Sundwall, Mark Korver, Allison Lieber, William Vambenepe, Matthew Hancher,  Rebecca Moore, Tyler Erickson, Josh Henretig,
Brant Zwiefel, Heather Patrick-Ahlstrom, Matthew J. Smith

With the volume of Earth observation data expanding rapidly, cloud computing is quickly changing the way these data are processed, analyzed, and visualized. Collocating freely available Earth observation data on a cloud computing infrastructure may create opportunities unforeseen by the original data provider for innovation and value-added data re-use, but existing systems at data centers are not designed for supporting requests for large data transfers.

A lack of common methodology necessitates that each data center handle such requests from different cloud vendors differently. Guidelines are needed to support enabling all cloud vendors to utilize a common methodology for bulk-downloading data from data centers, thus preventing the providers from building custom capabilities to meet the needs of individual vendors.

This paper presents recommendations distilled from use cases provided by three cloud vendors (Amazon, Google, and Microsoft) and are based on the vendors’ interactions with data systems at different Federal agencies and organizations.

These specific recommendations range from obvious steps for improving data usability (such as ensuring the use of standard data formats and commonly supported projections) to non-obvious undertakings important for enabling bulk data downloads at scale.

These recommendations can be used to evaluate and improve existing data systems for high-volume data transfers, and their adoption can lead to cloud vendors utilizing a common methodology.

URL : Recommendations to Improve Downloads of Large Earth Observation Data



Data Sharing: Convert Challenges into Opportunities

Author : Ana Sofia Figueiredo

Initiatives for sharing research data are opportunities to increase the pace of knowledge discovery and scientific progress. The reuse of research data has the potential to avoid the duplication of data sets and to bring new views from multiple analysis of the same data set.

For example, the study of genomic variations associated with cancer profits from the universal collection of such data and helps in selecting the most appropriate therapy for a specific patient. However, data sharing poses challenges to the scientific community.

These challenges are of ethical, cultural, legal, financial, or technical nature. This article reviews the impact that data sharing has in science and society and presents guidelines to improve the efficient sharing of research data.

URL : Data Sharing: Convert Challenges into Opportunities


Amplifying Data Curation Efforts to Improve the Quality of Life Science Data

Authors : Mariam Alqasab, Suzanne M. Embury, Sandra de F. Mendes Sampaio

In the era of data science, datasets are shared widely and used for many purposes unforeseen by the original creators of the data. In this context, defects in datasets can have far reaching consequences, spreading from dataset to dataset, and affecting the consumers of data in ways that are hard to predict or quantify.

Some form of waste is often the result. For example, scientists using defective data to propose hypotheses for experimentation may waste their limited wet lab resources chasing the wrong experimental targets. Scarce drug trial resources may be used to test drugs that actually have little chance of giving a cure.

Because of the potential real world costs, database owners care about providing high quality data. Automated curation tools can be used to an extent to discover and correct some forms of defect.

However, in some areas human curation, performed by highly-trained domain experts, is needed to ensure that the data represents our current interpretation of reality accurately.

Human curators are expensive, and there is far more curation work to be done than there are curators available to perform it. Tools and techniques are needed to enable the full value to be obtained from the curation effort currently available.

In this paper,we explore one possible approach to maximising the value obtained from human curators, by automatically extracting information about data defects and corrections from the work that the curators do.

This information is packaged in a source independent form, to allow it to be used by the owners of other databases (for which human curation effort is not available or is insufficient).

This amplifies the efforts of the human curators, allowing their work to be applied to other sources, without requiring any additional effort or change in their processes or tool sets. We show that this approach can discover significant numbers of defects, which can also be found in other sources.

URL : Amplifying Data Curation Efforts to Improve the Quality of Life Science Data


A review of data sharing statements in observational studies published in the BMJ: A cross-sectional study

Authors : Laura McDonald, Anna Schultze, Alex Simpson, Sophie Graham, Radek Wasiak, Sreeram V. Ramagopalan

In order to understand the current state of data sharing in observational research studies, we reviewed data sharing statements of observational studies published in a general medical journal, the British Medical Journal.

We found that the majority (63%) of observational studies published between 2015 and 2017 included a statement that implied that data used in the study could not be shared. If the findings of our exploratory study are confirmed, room for improvement in the sharing of real-world or observational research data exists.

URL : A review of data sharing statements in observational studies published in the BMJ: A cross-sectional study


Versioned data: why it is needed and how it can be achieved (easily and cheaply)

Authors : Daniel S. Falster, Richard G. FitzJohn, Matthew W. Pennell, William K. Cornwell

The sharing and re-use of data has become a cornerstone of modern science. Multiple platforms now allow quick and easy data sharing. So far, however, data publishing models have not accommodated on-going scientific improvements in data: for many problems, datasets continue to grow with time — more records are added, errors fixed, and new data structures are created. In other words, datasets, like scientific knowledge, advance with time.

We therefore suggest that many datasets would be usefully published as a series of versions, with a simple naming system to allow users to perceive the type of change between versions. In this article, we argue for adopting the paradigm and processes for versioned data, analogous to software versioning.

We also introduce a system called Versioned Data Delivery and present tools for creating, archiving, and distributing versioned data easily, quickly, and cheaply. These new tools allow for individual research groups to shift from a static model of data curation to a dynamic and versioned model that more naturally matches the scientific process.

URL : Versioned data: why it is needed and how it can be achieved (easily and cheaply)



The Evolution, Approval and Implementation of the U.S. Geological Survey Science Data Lifecycle Model

Authors : John L. Faundeen, Vivian B. Hutchison

This paper details how the U.S. Geological Survey (USGS) Community for Data Integration (CDI) Data Management Working Group developed a Science Data Lifecycle Model, and the role the Model plays in shaping agency-wide policies and data management applications.

Starting with an extensive literature review of existing data lifecycle models, representatives from various backgrounds in USGS attended a two-day meeting where the basic elements for the Science Data Lifecycle Model were determined.

Refinements and reviews spanned two years, leading to finalization of the model and documentation in a formal agency publication1.

The Model serves as a critical framework for data management policy, instructional resources, and tools. The Model helps the USGS address both the Office of Science and Technology Policy (OSTP)2 for increased public access to federally funded research, and the Office of Management and Budget (OMB)3 2013 Open Data directives, as the foundation for a series of agency policies related to data management planning, metadata development, data release procedures, and the long-term preservation of data.

Additionally, the agency website devoted to data management instruction and best practices ( is designed around the Model’s structure and concepts. This paper also illustrates how the Model is being used to develop tools for supporting USGS research and data management processes.



Building a Disciplinary, World‐Wide Data Infrastructure

Authors: Françoise Genova, Christophe Arviset, Bridget M. Almas, Laura Bartolo, Daan Broeder, Emily Law, Brian McMahon

Sharing scientific data with the objective of making it discoverable, accessible, reusable, and interoperable requires work and presents challenges being faced at the disciplinary level to define in particular how the data should be formatted and described.

This paper represents the Proceedings of a session held at SciDataCon 2016 (Denver, 12–13 September 2016). It explores the way a range of disciplines, namely materials science, crystallography, astronomy, earth sciences, humanities and linguistics, get organized at the international level to address those challenges. T

he disciplinary culture with respect to data sharing, science drivers, organization, lessons learnt and the elements of the data infrastructure which are or could be shared with others are briefly described. Commonalities and differences are assessed.

Common key elements for success are identified: data sharing should be science driven; defining the disciplinary part of the interdisciplinary standards is mandatory but challenging; sharing of applications should accompany data sharing. Incentives such as journal and funding agency requirements are also similar.

For all, social aspects are more challenging than technological ones. Governance is more diverse, often specific to the discipline organization. Being problem‐driven is also a key factor of success for building bridges to enable interdisciplinary research.

Several international data organizations such as CODATA, RDA and WDS can facilitate the establishment of disciplinary interoperability frameworks. As a spin‐off of the session, a RDA Disciplinary Interoperability Interest Group is proposed to bring together representatives across disciplines to better organize and drive the discussion for prioritizing, harmonizing and efficiently articulating disciplinary needs.

URL : Building a Disciplinary, World‐Wide Data Infrastructure