Institutional Data Repository Development, a Moving Target

Authors : Colleen Fallaw, Genevieve Schmitt, Hoa Luong, Jason Colwell, Jason Strutz

At the end of 2019, the Research Data Service (RDS) at the University of Illinois at Urbana-Champaign (UIUC) completed its fifth year as a campus-wide service. In order to gauge the effectiveness of the RDS in meeting the needs of Illinois researchers, RDS staff developed a five-year review consisting of a survey and a series of in-depth focus group interviews.

As a result, our institutional data repository developed in-house by University Library IT staff, Illinois Data Bank, was recognized as the most useful service offering by our unit. When launched in 2016, storage resources and web servers for Illinois Data Bank and supporting systems were hosted on-premises at UIUC.

As anticipated, researchers increasingly need to share large, and complex datasets. In a responsive effort to leverage the potentially more reliable, highly available, cost-effective, and scalable storage accessible to computation resources, we migrated our item bitstreams and web services to the cloud. Our efforts have met with success, but also with painful bumps along the way.

This article describes how we supported data curation workflows through transitioning from on-premises to cloud resource hosting. It details our approaches to ingesting, curating, and offering access to dataset files up to 2TB in size–which may be archive type files (e.g., .zip or .tar) containing complex directory structures.

URL : https://journal.code4lib.org/articles/15821

Repository Approaches to Improving the Quality of Shared Data and Code

Authors : Ana Trisovic, Katherine Mika, Ceilyn Boyd, Sebastian Feger, Mercè Crosas

Sharing data and code for reuse has become increasingly important in scientific work over the past decade. However, in practice, shared data and code may be unusable, or published results obtained from them may be irreproducible.

Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets.

This paper presents a combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code.

The findings of these studies are sorted into three approaches that can be valuable to data repositories, archives, and other research dissemination platforms.

URL : Repository Approaches to Improving the Quality of Shared Data and Code

DOI : https://doi.org/10.3390/data6020015

Alter-Value in Data Reuse: Non-Designated Communities and Creative Processes

Author : Guillaume Boutard

This paper builds on the investigation of data reuse in creative processes to discuss ‘epistemic pluralism’ and data ‘alter-value’ in research data management. Focussing on a specific non-designated community, we conducted semi-structured interviews with five artists in relation to five works.

Data reuse is a critical component of all these works. The qualitative content analysis brings to light agonistic-antagonistic practices in data reuse and shows multiple deconstructions of the notion of data value as it is portrayed in the data reuse literature.

Finally, the paper brings to light the benefits of including such practices in the conceptualization of data curation.

URL : Alter-Value in Data Reuse: Non-Designated Communities and Creative Processes

DOI : http://doi.org/10.5334/dsj-2020-023

Practices, Challenges, and Prospects of Big Data Curation: a Case Study in Geoscience

Authors : Suzhen Chen, Bin Chen

Open and persistent access to past, present, and future scientific data is fundamental for transparent and reproducible data-driven research. The scientific community is now facing both challenges and opportunities caused by the growingly complex disciplinary data systems.

Concerted efforts from domain experts, information professionals, and Internet technology experts are essential to ensure the accessibility and interoperability of the big data.

Here we review current practices in building and managing big data within the context of large data infrastructure, using geoscience cyberinfrastructure such as Interdisciplinary Earth Data Alliance (IEDA) and EarthCube as a case study.

Geoscience is a data-rich discipline with a rapid expansion of sophisticated and diverse digital data sets. Having started to embrace the digital age, the community have applied big data and data mining tools into the new type of research.

We also identified current challenges, key elements, and prospects to construct a more robust and future-proof big data infrastructure for research and publication for the future, as well as the roles, qualifications, and opportunities for librarians/information professionals in the data era.

URL : Practices, Challenges, and Prospects of Big Data Curation: a Case Study in Geoscience

DOI: https://doi.org/10.2218/ijdc.v14i1.669

Data Curation for Big Interdisciplinary Science: The Pulley Ridge Experience

Authors : Timothy B. Norris, Christopher C. Mader

The curation and preservation of scientific data has long been recognized as an essential activity for the reproducibility of science and the advancement of knowledge. While investment into data curation for specific disciplines and at individual research institutions has advanced the ability to preserve research data products, data curation for big interdisciplinary science remains relatively unexplored terrain.

To fill this lacunae, this article presents a case study of the data curation for the National Centers for Coastal Ocean Science (NCCOS) funded project “Understanding Coral Ecosystem Connectivity in the Gulf of Mexico-Pulley Ridge to the Florida Keys” undertaken from 2011 to 2018 by more than 30 researchers at several research institutions.

The data curation process is described and a discussion of strengths, weaknesses and lessons learned is presented. Major conclusions from this case study include: the reimplementation of data repository infrastructure builds valuable institutional data curation knowledge but may not meet data curation standards and best practices; data from big interdisciplinary science can be considered as a special collection with the implication that metadata takes the form of a finding aid or catalog of datasets within the larger project context; and there are opportunities for data curators and librarians to synthesize and integrate results across disciplines and to create exhibits as stories that emerge from interdisciplinary big science.

URL : Data Curation for Big Interdisciplinary Science: The Pulley Ridge Experience

Alternative location : https://escholarship.umassmed.edu/jeslib/vol8/iss2/8/

Peer Review of Research Data Submissions to ScholarsArchive@OSU: How can we improve the curation of research datasets to enhance reusability?

Authors : Clara Llebot, Steven Van Tuyl

Objective

Best practices such as the FAIR Principles (Findability, Accessibility, Interoperability, Reusability) were developed to ensure that published datasets are reusable. While we employ best practices in the curation of datasets, we want to learn how domain experts view the reusability of datasets in our institutional repository, ScholarsArchive@OSU.

Curation workflows are designed by data curators based on their own recommendations, but research data is extremely specialized, and such workflows are rarely evaluated by researchers.

In this project we used peer-review by domain experts to evaluate the reusability of the datasets in our institutional repository, with the goal of informing our curation methods and ensure that the limited resources of our library are maximizing the reusability of research data.

Methods

We asked all researchers who have datasets submitted in Oregon State University’s repository to refer us to domain experts who could review the reusability of their data sets. Two data curators who are non-experts also reviewed the same datasets.

We gave both groups review guidelines based on the guidelines of several journals. Eleven domain experts and two data curators reviewed eight datasets.

The review included the quality of the repository record, the quality of the documentation, and the quality of the data. We then compared the comments given by the two groups.

Results

Domain experts and non-expert data curators largely converged on similar scores for reviewed datasets, but the focus of critique by domain experts was somewhat divergent.

A few broad issues common across reviews were: insufficient documentation, the use of links to journal articles in the place of documentation, and concerns about duplication of effort in creating documentation and metadata. Reviews also reflected the background and skills of the reviewer.

Domain experts expressed a lack of expertise in data curation practices and data curators expressed their lack of expertise in the research domain.

Conclusions

The results of this investigation could help guide future research data curation activities and align domain expert and data curator expectations for reusability of datasets.

We recommend further exploration of these common issues and additional domain expert peer-review project to further refine and align expectations for research data reusability.

URL : Peer Review of Research Data Submissions to ScholarsArchive@OSU: How can we improve the curation of research datasets to enhance reusability?

DOI : https://doi.org/10.7191/jeslib.2019.1166

Research data management in the French National Research Center (CNRS)

Authors : Joachim Schöpfel, Coline Ferrant, Francis Andre, Renaud Fabre

Purpose

The purpose of this paper is to present empirical evidence on the opinion and behaviour of French scientists (senior management level) regarding research data management (RDM).

Design/methodology/approach

The results are part of a nationwide survey on scientific information and documentation with 432 directors of French public research laboratories conducted by the French Research Center CNRS in 2014.

Findings

The paper presents empirical results about data production (types), management (human resources, IT, funding, and standards), data sharing and related needs, and highlights significant disciplinary differences.

Also, it appears that RDM and data sharing is not directly correlated with the commitment to open access. Regarding the FAIR data principles, the paper reveals that 68 per cent of all laboratory directors affirm that their data production and management is compliant with at least one of the FAIR principles.

But only 26 per cent are compliant with at least three principles, and less than 7 per cent are compliant with all four FAIR criteria, with laboratories in nuclear physics, SSH and earth sciences and astronomy being in advance of other disciplines, especially concerning the findability and the availability of their data output.

The paper concludes with comments about research data service development and recommendations for an institutional RDM policy.

Originality/value

For the first time, a nationwide survey was conducted with the senior research management level from all scientific disciplines. Surveys on RDM usually assess individual data behaviours, skills and needs. This survey is different insofar as it addresses institutional and collective data practice.

The respondents did not report on their own data behaviours and attitudes but were asked to provide information about their laboratory. The response rate was high (>30 per cent), and the results provide good insight into the real support and uptake of RDM by senior research managers who provide both models (examples for good practice) and opinion leadership.

URL : https://hal.univ-lille3.fr/hal-01728541/