Institutional Data Repository Development, a Moving Target

Authors : Colleen Fallaw, Genevieve Schmitt, Hoa Luong, Jason Colwell, Jason Strutz

At the end of 2019, the Research Data Service (RDS) at the University of Illinois at Urbana-Champaign (UIUC) completed its fifth year as a campus-wide service. In order to gauge the effectiveness of the RDS in meeting the needs of Illinois researchers, RDS staff developed a five-year review consisting of a survey and a series of in-depth focus group interviews.

As a result, our institutional data repository developed in-house by University Library IT staff, Illinois Data Bank, was recognized as the most useful service offering by our unit. When launched in 2016, storage resources and web servers for Illinois Data Bank and supporting systems were hosted on-premises at UIUC.

As anticipated, researchers increasingly need to share large, and complex datasets. In a responsive effort to leverage the potentially more reliable, highly available, cost-effective, and scalable storage accessible to computation resources, we migrated our item bitstreams and web services to the cloud. Our efforts have met with success, but also with painful bumps along the way.

This article describes how we supported data curation workflows through transitioning from on-premises to cloud resource hosting. It details our approaches to ingesting, curating, and offering access to dataset files up to 2TB in size–which may be archive type files (e.g., .zip or .tar) containing complex directory structures.


Evidence for Trusted Digital Repository Reviews: An Analysis of Perspectives

Author : Jonathan David Crabtree

Building trust in our research infrastructure is important for the future of the academy. Trust in research data repositories is critical as they provide the evidence for past discoveries as well as the input for future discoveries.

Archives and repositories are examining their options for trustworthy review, audit, and certification as a means to build trust within their content creator and user communities. One option these institutions have is to increase and demonstrate their trustworthiness is to apply for the CoreTrustSeal.

Applicants for the CoreTrustSeal are becoming more numerous and diverse, ranging general purpose repositories, preservation infrastructure providers, and domain repositories. This demand for certification and the subjective nature of decisions around levels of CORETrustSeal compliance drives this dissertation.

It is a study of the review process and its veracity and consistency in determining the trustworthiness of applicant repositories. Several assumptions underlie this work. First, audits and reviews must be based on evidence supplied by the repository under scrutiny; second, and not all reviewers will approach a piece of evidence in the same fashion or give it the same weight. Third, the value and veracity of required evidence may be subject to reviewers’ diverse perspectives and diverse repository community norms.

This research used a thematic qualitative analysis approach to identify similarities and differences in CoreTrustSeal reviewers’ responses during semi-structured interviews in order to better understand potential subjective differences among respondents. The participants’ non-probabilistic sample represented a balance in perspectives across three anticipated categories: administrator, archivist, and technologist.

Themes converged around several key concepts. Nearly all participants felt they were performing a peer review process and working to help the repository community and the research enterprise.

Reviewers were questioned about the various CoreTrustSeal application requirements and which ones they felt were the most important. No clear evidence emerged to indicate that variations in perspectives affected the subjective review of application evidence. The same categories of evidence were often selected and identified as being critical across all three categories (i.e., administrator, archivist, and technologist).

Many valuable suggestions from participants were recorded and can be implemented to ensure the consistency and sustainability of this trusted repository review process.

These suggestions and concepts were also very evenly distributed across the three perspectives. The balance in perspectives is potentially due to participants’ experience levels and their years of experience in various positions, holding many responsibilities, within the organizations they represented.


Openness in Big Data and Data Repositories. The Application of an Ethics Framework for Big Data in Healthand Research

Authors : Vicki Xafis, Markus K. Labude

There is a growing expectation, or even requirement, for researchers to deposit a variety of research data in data repositories as a condition of funding or publication. This expectation recognizes the enormous benefits of data collected and created for research purposes being made available for secondary uses, as open science gains increasing support.

This is particularly so in the context of big data, especially where health data is involved. There are, however, also challenges relating to the collection, storage, and re-use of research data.

This paper gives a brief overview of the landscape of data sharing via data repositories and discusses some of the key ethical issues raised by the sharing of health-related research data, including expectations of privacy and confidentiality, the transparency of repository governance structures, access restrictions, as well as data ownership and the fair attribution of credit.

To consider these issues and the values that are pertinent, the paper applies the deliberative balancing approach articulated in the Ethics Framework for Big Data in Health and Research (Xafis et al. 2019) to the domain of Openness in Big Data and Data Repositories.

Please refer to that article for more information on how this framework is to be used, including a full explanation of the key values involved and the balancing approach used in the case study at the end.

URL : Openness in Big Data and Data Repositories. The Application of an Ethics Framework for Big Data in Healthand Research


Ouverture des données de recherche dans le domaine académique suisse : outils pour le choix d’une stratégie institutionnelle en matière de dépôt de données

Auteur/Author : Marielle Guirlet

Le contexte actuel de l’Open Science se traduit par des exigences d’ouverture des données de recherche. Le dépôt de données est un instrument crucial pour partager publiquement ces données.

Néanmoins, l’offre actuelle pléthorique et très diverse rend la sélection du dépôt difficile pour les chercheurs et les chercheuses. Pour les aider, leurs institutions d’affiliation émettent des recommandations pour le choix du meilleur dépôt. Elles proposent parfois aussi leur propre dépôt de données ou envisagent de le créer.

Cette étude, basée sur un travail de Master en sciences de l’information, s’intéresse à la démarche que les institutions académiques suisses peuvent suivre pour définir leur stratégie de soutien aux chercheurs et aux chercheuses en termes de dépôt.

Elle identifie aussi les informations qui vont aider ces institutions à choisir entre orienter ces chercheurs et ces chercheuses vers un dépôt existant (et lequel) et créer un nouveau dépôt, et aux spécifications que ce dépôt doit remplir.

Après avoir défini les concepts des données de recherche et des dépôts ouverts, les fonctionnalités, les outils et les services nécessaires à un dépôt pour mettre en œuvre le partage public de données sont discutés.

A partir des critères utilisés par la certification CoreTrustSeal pour évaluer la qualité d’un dépôt, et en tenant compte de ces fonctionnalités, de ces outils et ces services, un modèle de description d’un dépôt de données de recherche ouvertes est élaboré. Ce modèle peut être utilisé pour l’évaluation d’un dépôt existant ou pour la conception d’un nouveau dépôt.

Les stratégies de neuf institutions académiques suisses en matière de dépôt de données de recherche, dépôts utilisés et dépôts recommandés, sont analysées. Des recommandations sont formulées, sur la base des bonnes pratiques observées.

Des outils développés pour le choix de la meilleure stratégie en termes de dépôt de données de recherche ouvertes sont alors présentés. Un vade-mecum se présentant comme une liste de questions permet de collecter certaines informations utiles.

Un guide décisionnel accompagne l’institution dans sa réflexion et lui permet de choisir sa stratégie de façon éclairée, avec les informations collectées précédemment. Une fois cette stratégie choisie, des informations complémentaires et des recommandations sont disponibles pour sa mise en pratique.

Une version prototype de ces outils pour navigateur Internet est aussi présentée. Elle est adaptable à une évolution du contexte et transposable à d’autres pays.


Repository Approaches to Improving the Quality of Shared Data and Code

Authors : Ana Trisovic, Katherine Mika, Ceilyn Boyd, Sebastian Feger, Mercè Crosas

Sharing data and code for reuse has become increasingly important in scientific work over the past decade. However, in practice, shared data and code may be unusable, or published results obtained from them may be irreproducible.

Data repository features and services contribute significantly to the quality, longevity, and reusability of datasets.

This paper presents a combination of original and secondary data analysis studies focusing on computational reproducibility, data curation, and gamified design elements that can be employed to indicate and improve the quality of shared data and code.

The findings of these studies are sorted into three approaches that can be valuable to data repositories, archives, and other research dissemination platforms.

URL : Repository Approaches to Improving the Quality of Shared Data and Code


Improving Opportunities for New Value of Open Data: Assessing and Certifying Research Data Repositories

Author : Robert R. Downs

Investments in research that produce scientific and scholarly data can be leveraged by enabling the resulting research data products and services to be used by broader communities and for new purposes, extending reuse beyond the initial users and purposes for which the data were originally collected.

Submitting research data to a data repository offers opportunities for the data to be used in the future, providing ways for new benefits to be realized from data reuse. Improvements to data repositories that facilitate new uses of data increase the potential for data reuse and for gains in the value of open data products and services that are associated with such reuse.

Assessing and certifying the capabilities and services offered by data repositories provides opportunities for improving the repositories and for realizing the value to be attained from new uses of data.

The evolution of data repository certification instruments is described and discussed in terms of the implications for the curation and continuing use of research data.

URL : Improving Opportunities for New Value of Open Data: Assessing and Certifying Research Data Repositories


Entrepôts de données de recherche : mesurer l’impact de l’Open Science à l’aune de la consultation des jeux de données déposés

Auteur/Author  : Violaine Rebouillat

Les décennies 2000 et 2010 ont vu se développer un nombre croissant de e-infrastructures de recherche, rendant plus aisés le partage et l’accès aux données scientifiques. Cette tendance s’est vue renforcée par l’essor de politiques d’ouverture des données, lesquelles ont donné lieu à une multiplication de réservoirs de données – aussi appelés « entrepôts de données ». Quantifier et qualifier l’utilisation des données rendues publiques constitue un élément essentiel pour évaluer l’impact des politiques d’ouverture des données.

Dans cet article, nous questionnons l’utilisation des données déposées dans les entrepôts. Dans quelle mesure ces données sont-elles consultées et téléchargées ?

L’article présente les premiers résultats d’une enquête quantitative auprès de 20 entrepôts. Il esquisse deux tendances, qui restent à ce stade propres à l’échantillon étudié, à savoir : (1) l’augmentation globale du nombre de consultations, de téléchargements et de données disponibles dans les entrepôts sur la période étudiée (2015-2020), et (2) la concentration des téléchargements sur une proportion relativement faible des données de l’entrepôt (de l’ordre de 10% à 30%).