Practices, Challenges, and Prospects of Big Data Curation: a Case Study in Geoscience

Authors : Suzhen Chen, Bin Chen

Open and persistent access to past, present, and future scientific data is fundamental for transparent and reproducible data-driven research. The scientific community is now facing both challenges and opportunities caused by the growingly complex disciplinary data systems.

Concerted efforts from domain experts, information professionals, and Internet technology experts are essential to ensure the accessibility and interoperability of the big data.

Here we review current practices in building and managing big data within the context of large data infrastructure, using geoscience cyberinfrastructure such as Interdisciplinary Earth Data Alliance (IEDA) and EarthCube as a case study.

Geoscience is a data-rich discipline with a rapid expansion of sophisticated and diverse digital data sets. Having started to embrace the digital age, the community have applied big data and data mining tools into the new type of research.

We also identified current challenges, key elements, and prospects to construct a more robust and future-proof big data infrastructure for research and publication for the future, as well as the roles, qualifications, and opportunities for librarians/information professionals in the data era.

URL : Practices, Challenges, and Prospects of Big Data Curation: a Case Study in Geoscience

DOI: https://doi.org/10.2218/ijdc.v14i1.669

Data Curation for Big Interdisciplinary Science: The Pulley Ridge Experience

Authors : Timothy B. Norris, Christopher C. Mader

The curation and preservation of scientific data has long been recognized as an essential activity for the reproducibility of science and the advancement of knowledge. While investment into data curation for specific disciplines and at individual research institutions has advanced the ability to preserve research data products, data curation for big interdisciplinary science remains relatively unexplored terrain.

To fill this lacunae, this article presents a case study of the data curation for the National Centers for Coastal Ocean Science (NCCOS) funded project “Understanding Coral Ecosystem Connectivity in the Gulf of Mexico-Pulley Ridge to the Florida Keys” undertaken from 2011 to 2018 by more than 30 researchers at several research institutions.

The data curation process is described and a discussion of strengths, weaknesses and lessons learned is presented. Major conclusions from this case study include: the reimplementation of data repository infrastructure builds valuable institutional data curation knowledge but may not meet data curation standards and best practices; data from big interdisciplinary science can be considered as a special collection with the implication that metadata takes the form of a finding aid or catalog of datasets within the larger project context; and there are opportunities for data curators and librarians to synthesize and integrate results across disciplines and to create exhibits as stories that emerge from interdisciplinary big science.

URL : Data Curation for Big Interdisciplinary Science: The Pulley Ridge Experience

Alternative location : https://escholarship.umassmed.edu/jeslib/vol8/iss2/8/

Peer Review of Research Data Submissions to ScholarsArchive@OSU: How can we improve the curation of research datasets to enhance reusability?

Authors : Clara Llebot, Steven Van Tuyl

Objective

Best practices such as the FAIR Principles (Findability, Accessibility, Interoperability, Reusability) were developed to ensure that published datasets are reusable. While we employ best practices in the curation of datasets, we want to learn how domain experts view the reusability of datasets in our institutional repository, ScholarsArchive@OSU.

Curation workflows are designed by data curators based on their own recommendations, but research data is extremely specialized, and such workflows are rarely evaluated by researchers.

In this project we used peer-review by domain experts to evaluate the reusability of the datasets in our institutional repository, with the goal of informing our curation methods and ensure that the limited resources of our library are maximizing the reusability of research data.

Methods

We asked all researchers who have datasets submitted in Oregon State University’s repository to refer us to domain experts who could review the reusability of their data sets. Two data curators who are non-experts also reviewed the same datasets.

We gave both groups review guidelines based on the guidelines of several journals. Eleven domain experts and two data curators reviewed eight datasets.

The review included the quality of the repository record, the quality of the documentation, and the quality of the data. We then compared the comments given by the two groups.

Results

Domain experts and non-expert data curators largely converged on similar scores for reviewed datasets, but the focus of critique by domain experts was somewhat divergent.

A few broad issues common across reviews were: insufficient documentation, the use of links to journal articles in the place of documentation, and concerns about duplication of effort in creating documentation and metadata. Reviews also reflected the background and skills of the reviewer.

Domain experts expressed a lack of expertise in data curation practices and data curators expressed their lack of expertise in the research domain.

Conclusions

The results of this investigation could help guide future research data curation activities and align domain expert and data curator expectations for reusability of datasets.

We recommend further exploration of these common issues and additional domain expert peer-review project to further refine and align expectations for research data reusability.

URL : Peer Review of Research Data Submissions to ScholarsArchive@OSU: How can we improve the curation of research datasets to enhance reusability?

DOI : https://doi.org/10.7191/jeslib.2019.1166

Research data management in the French National Research Center (CNRS)

Authors : Joachim Schöpfel, Coline Ferrant, Francis Andre, Renaud Fabre

Purpose

The purpose of this paper is to present empirical evidence on the opinion and behaviour of French scientists (senior management level) regarding research data management (RDM).

Design/methodology/approach

The results are part of a nationwide survey on scientific information and documentation with 432 directors of French public research laboratories conducted by the French Research Center CNRS in 2014.

Findings

The paper presents empirical results about data production (types), management (human resources, IT, funding, and standards), data sharing and related needs, and highlights significant disciplinary differences.

Also, it appears that RDM and data sharing is not directly correlated with the commitment to open access. Regarding the FAIR data principles, the paper reveals that 68 per cent of all laboratory directors affirm that their data production and management is compliant with at least one of the FAIR principles.

But only 26 per cent are compliant with at least three principles, and less than 7 per cent are compliant with all four FAIR criteria, with laboratories in nuclear physics, SSH and earth sciences and astronomy being in advance of other disciplines, especially concerning the findability and the availability of their data output.

The paper concludes with comments about research data service development and recommendations for an institutional RDM policy.

Originality/value

For the first time, a nationwide survey was conducted with the senior research management level from all scientific disciplines. Surveys on RDM usually assess individual data behaviours, skills and needs. This survey is different insofar as it addresses institutional and collective data practice.

The respondents did not report on their own data behaviours and attitudes but were asked to provide information about their laboratory. The response rate was high (>30 per cent), and the results provide good insight into the real support and uptake of RDM by senior research managers who provide both models (examples for good practice) and opinion leadership.

URL : https://hal.univ-lille3.fr/hal-01728541/

Curating Scientific Information in Knowledge Infrastructures

Authors : Markus Stocker, Pauli Paasonen, Markus Fiebig, Martha A. Zaidan, Alex Hardisty

Interpreting observational data is a fundamental task in the sciences, specifically in earth and environmental science where observational data are increasingly acquired, curated, and published systematically by environmental research infrastructures.

Typically subject to substantial processing, observational data are used by research communities, their research groups and individual scientists, who interpret such primary data for their meaning in the context of research investigations.

The result of interpretation is information—meaningful secondary or derived data—about the observed environment. Research infrastructures and research communities are thus essential to evolving uninterpreted observational data to information. In digital form, the classical bearer of information are the commonly known “(elaborated) data products,” for instance maps.

In such form, meaning is generally implicit e.g., in map colour coding, and thus largely inaccessible to machines. The systematic acquisition, curation, possible publishing and further processing of information gained in observational data interpretation—as machine readable data and their machine readable meaning—is not common practice among environmental research infrastructures.

For a use case in aerosol science, we elucidate these problems and present a Jupyter based prototype infrastructure that exploits a machine learning approach to interpretation and could support a research community in interpreting observational data and, more importantly, in curating and further using resulting information about a studied natural phenomenon.

URL : Curating Scientific Information in Knowledge Infrastructures

DOI : http://doi.org/10.5334/dsj-2018-021

Conceptualizing Data Curation Activities Within Two Academic Libraries

Authors : Sophia Lafferty-Hess, Julie Rudder, Moira Downey, Susan Ivey, Jennifer Darragh

A growing focus on sharing research data that meet certain standards, such as the FAIR guiding principles, has resulted in libraries increasingly developing and scaling up support for research data.

As libraries consider what new data curation services they would like to provide as part of their repository programs, there are various questions that arise surrounding scalability, resource allocation, requisite expertise, and how to communicate these services to the research community.

Data curation can involve a variety of tasks and activities. Some of these activities can be managed by systems, some require human intervention, and some require highly specialized domain or data type expertise.

At the 2017 Triangle Research Libraries Network Institute, staff from the University of North Carolina at Chapel Hill and Duke University used the 47 data curation activities identified by the Data Curation Network project to create conceptual groupings of data curation activities.

The results of this “thought-exercise” are discussed in this white paper. The purpose of this exercise was to provide more specificity around data curation within our individual contexts as a method to consistently discuss our current service models, identify gaps we would like to fill, and determine what is currently out of scope.

We hope to foster an open and productive discussion throughout the larger academic library community about how we prioritize data curation activities as we face growing demand and limited resources.

URL : Conceptualizing Data Curation Activities Within Two Academic Libraries

DOI : https://dx.doi.org/10.17605/OSF.IO/ZJ5PQ

How Important is Data Curation? Gaps and Opportunities for Academic Libraries

Authors: Lisa R Johnston, Jacob Carlson, Cynthia Hudson-Vitale, Heidi Imker, Wendy Kozlowski, Robert Olendorf, Claire Stewart

INTRODUCTION

Data curation may be an emerging service for academic libraries, but researchers actively “curate” their data in a number of ways—even if terminology may not always align. Building on past userneeds assessments performed via survey and focus groups, the authors sought direct input from researchers on the importance and utilization of specific data curation activities.

METHODS

Between October 21, 2016, and November 18, 2016, the study team held focus groups with 91 participants at six different academic institutions to determine which data curation activities were most important to researchers, which activities were currently underway for their data, and how satisfied they were with the results.

RESULTS

Researchers are actively engaged in a variety of data curation activities, and while they considered most data curation activities to be highly important, a majority of the sample reported dissatisfaction with the current state of data curation at their institution.

DISCUSSION

Our findings demonstrate specific gaps and opportunities for academic libraries to focus their data curation services to more effectively meet researcher needs.

CONCLUSION

Research libraries stand to benefit their users by emphasizing, investing in, and/or heavily promoting the highly valued services that may not currently be in use by many researchers.

URL : How Important is Data Curation? Gaps and Opportunities for Academic Libraries

DOI : http://doi.org/10.7710/2162-3309.2198