Are Research Datasets FAIR in the Long Run?

Authors : Dennis Wehrle, Klaus Rechert

Currently, initiatives in Germany are developing infrastructure to accept and preserve dissertation data together with the dissertation texts (on state level – bwDATA Diss, on federal level – eDissPlus).

In contrast to specialized data repositories, these services will accept data from all kind of research disciplines. To ensure FAIR data principles (Wilkinson et al., 2016), preservation plans are required, because ensuring accessibility, interoperability and re-usability even for a minimum ten year data redemption period can become a major challenge.

Both for longevity and re-usability, file formats matter. In order to ensure access to data, the data’s encoding, i.e. their technical and structural representation in form of file formats, needs to be understood. Hence, due to a fast technical lifecycle, interoperability, re-use and in some cases even accessibility depends on the data’s format and our future ability to parse or render these.

This leads to several practical questions regarding quality assurance, potential access options and necessary future preservation steps. In this paper, we analyze datasets from public repositories and apply a file format based long-term preservation risk model to support workflows and services for non-domain specific data repositories.

URL : Are Research Datasets FAIR in the Long Run?


Blockchain and OECD data repositories: opportunities and policymaking implications

Authors : Miguel-Angel Sicilia, Anna Visvizi


The purpose of this paper is to employ the case of Organization for Economic Cooperation and Development (OECD) data repositories to examine the potential of blockchain technology in the context of addressing basic contemporary societal concerns, such as transparency, accountability and trust in the policymaking process. Current approaches to sharing data employ standardized metadata, in which the provider of the service is assumed to be a trusted party.

However, derived data, analytic processes or links from policies, are in many cases not shared in the same form, thus breaking the provenance trace and making the repetition of analysis conducted in the past difficult. Similarly, it becomes tricky to test whether certain conditions justifying policies implemented still apply.

A higher level of reuse would require a decentralized approach to sharing both data and analytic scripts and software. This could be supported by a combination of blockchain and decentralized file system technology.


The findings presented in this paper have been derived from an analysis of a case study, i.e., analytics using data made available by the OECD. The set of data the OECD provides is vast and is used broadly.

The argument is structured as follows. First, current issues and topics shaping the debate on blockchain are outlined. Then, a redefinition of the main artifacts on which some simple or convoluted analytic results are based is revised for some concrete purposes.

The requirements on provenance, trust and repeatability are discussed with regards to the architecture proposed, and a proof of concept using smart contracts is used for reasoning on relevant scenarios.


A combination of decentralized file systems and an open blockchain such as Ethereum supporting smart contracts can ascertain that the set of artifacts used for the analytics is shared. This enables the sequence underlying the successive stages of research and/or policymaking to be preserved.

This suggests that, in turn, and ex post, it becomes possible to test whether evidence supporting certain findings and/or policy decisions still hold. Moreover, unlike traditional databases, blockchain technology makes it possible that immutable records can be stored.

This means that the artifacts can be used for further exploitation or repetition of results. In practical terms, the use of blockchain technology creates the opportunity to enhance the evidence-based approach to policy design and policy recommendations that the OECD fosters.

That is, it might enable the stakeholders not only to use the data available in the OECD repositories but also to assess corrections to a given policy strategy or modify its scope.

Research limitations/implications

Blockchains and related technologies are still maturing, and several questions related to their use and potential remain underexplored. Several issues require particular consideration in future research, including anonymity, scalability and stability of the data repository.

This research took as example OECD data repositories, precisely to make the point that more research and more dialogue between the research and policymaking community is needed to embrace the challenges and opportunities blockchain technology generates.

Several questions that this research prompts have not been addressed. For instance, the question of how the sharing economy concept for the specifics of the case could be employed in the context of blockchain has not been dealt with.

Practical implications

The practical implications of the research presented here can be summarized in two ways. On the one hand, by suggesting how a combination of decentralized file systems and an open blockchain, such as Ethereum supporting smart contracts, can ascertain that artifacts are shared, this paper paves the way toward a discussion on how to make this approach and solution reality.

The approach and architecture proposed in this paper would provide a way to increase the scope of the reuse of statistical data and results and thus would improve the effectiveness of decision making as well as the transparency of the evidence supporting policy.

Social implications

Decentralizing analytic artifacts will add to existing open data practices an additional layer of benefits for different actors, including but not limited to policymakers, journalists, analysts and/or researchers without the need to establish centrally managed institutions.

Moreover, due to the degree of decentralization and absence of a single-entry point, the vulnerability of data repositories to cyberthreats might be reduced. Simultaneously, by ensuring that artifacts derived from data based in those distributed depositories are made immutable therein, full reproducibility of conclusions concerning the data is possible.

In the field of data-driven policymaking processes, it might allow policymakers to devise more accurate ways of addressing pressing issues and challenges.


This paper offers the first blueprint of a form of sharing that complements open data practices with the decentralized approach of blockchain and decentralized file systems.

The case of OECD data repositories is used to highlight that while data storing is important, the real added value of blockchain technology rests in the possible change on how we use the data and data sets in the repositories. It would eventually enable a more transparent and actionable approach to linking policy up with the supporting evidence.

From a different angle, throughout the paper the case is made that rather than simply data, artifacts from conducted analyses should be made persistent in a blockchain.

What is at stake is the full reproducibility of conclusions based on a given set of data, coupled with the possibility of ex post testing the validity of the assumptions and evidence underlying those conclusions.


Data Discovery Paradigms: User Requirements and Recommendations for Data Repositories

Authors: Mingfang Wu, Fotis Psomopoulos, Siri Jodha Khalsa, Anita de Waard

As data repositories make more data openly available it becomes challenging for researchers to find what they need either from a repository or through web search engines.

This study attempts to investigate data users’ requirements and the role that data repositories can play in supporting data discoverability by meeting those requirements.

We collected 79 data discovery use cases (or data search scenarios), from which we derived nine functional requirements for data repositories through qualitative analysis.

We then applied usability heuristic evaluation and expert review methods to identify best practices that data repositories can implement to meet each functional requirement.

We propose the following ten recommendations for data repository operators to consider for improving data discoverability and user’s data search experience:

1. Provide a range of query interfaces to accommodate various data search behaviours.

2. Provide multiple access points to find data.

3. Make it easier for researchers to judge relevance, accessibility and reusability of a data collection from a search summary.

4. Make individual metadata records readable and analysable.

5. Enable sharing and downloading of bibliographic references.

6. Expose data usage statistics.

7. Strive for consistency with other repositories.

8. Identify and aggregate metadata records that describe the same data object.

9. Make metadata records easily indexed and searchable by major web search engines.

10. Follow API search standards and community adopted vocabularies for interoperability.


Facilitating and Improving Environmental Research Data Repository Interoperability

Authors : Corinna Gries, Amber Budden, Christine Laney, Margaret O’Brien, Mark Servilla, Wade Sheldon, Kristin Vanderbilt, David Vieglais

Environmental research data repositories provide much needed services for data preservation and data dissemination to diverse communities with domain specific or programmatic data needs and standards.

Due to independent development these repositories serve their communities well, but were developed with different technologies, data models and using different ontologies. Hence, the effectiveness and efficiency of these services can be vastly improved if repositories work together adhering to a shared community platform that focuses on the implementation of agreed upon standards and best practices for curation and dissemination of data.

Such a community platform drives forward the convergence of technologies and practices that will advance cross-domain interoperability. It will also facilitate contributions from investigators through standardized and streamlined workflows and provide increased visibility for the role of data managers and the curation services provided by data repositories, beyond preservation infrastructure.

Ten specific suggestions for such standardizations are outlined without any suggestions for priority or technical implementation. Although the recommendations are for repositories to implement, they have been chosen specifically with the data provider/data curator and synthesis scientist in mind.

URL : Facilitating and Improving Environmental Research Data Repository Interoperability


A Data-Driven Approach to Appraisal and Selection at a Domain Data Repository

Authors : Amy M Pienta, Dharma Akmon, Justin Noble, Lynette Hoelter, Susan Jekielek

Social scientists are producing an ever-expanding volume of data, leading to questions about appraisal and selection of content given finite resources to process data for reuse. We analyze users’ search activity in an established social science data repository to better understand demand for data and more effectively guide collection development.

By applying a data-driven approach, we aim to ensure curation resources are applied to make the most valuable data findable, understandable, accessible, and usable. We analyze data from a domain repository for the social sciences that includes over 500,000 annual searches in 2014 and 2015 to better understand trends in user search behavior.

Using a newly created search-to-study ratio technique, we identified gaps in the domain data repository’s holdings and leveraged this analysis to inform our collection and curation practices and policies.

The evaluative technique we propose in this paper will serve as a baseline for future studies looking at trends in user demand over time at the domain data repository being studied with broader implications for other data repositories.

URL : A Data-Driven Approach to Appraisal and Selection at a Domain Data Repository


The Changing Influence of Journal Data Sharing Policies on Local RDM Practices

Authors : Dylanne Dearborn, Steve Marks, Leanne Trimble

The purpose of this study was to examine changes in research data deposit policies of highly ranked journals in the physical and applied sciences between 2014 and 2016, as well as to develop an approach to examining the institutional impact of deposit requirements.

Policies from the top ten journals (ranked by impact factor from the Journal Citation Reports) were examined in 2014 and again in 2016 in order to determine if data deposits were required or recommended, and which methods of deposit were listed as options.

For all 2016 journals with a required data deposit policy, publication information (2009-2015) for the University of Toronto was pulled from Scopus and departmental affiliation was determined for each article.

The results showed that the number of high-impact journals in the physical and applied sciences requiring data deposit is growing. In 2014, 71.2% of journals had no policy, 14.7% had a recommended policy, and 13.9% had a required policy (n=836).

In contrast, in 2016, there were 58.5% with no policy, 19.4% with a recommended policy, and 22.0% with a required policy (n=880). It was also evident that U of T chemistry researchers are by far the most heavily affected by these journal data deposit requirements, having published 543 publications, representing 32.7% of all publications in the titles requiring data deposit in 2016.

The Python scripts used to retrieve institutional publications based on a list of ISSNs have been released on GitHub so that other institutions can conduct similar research.

URL : The Changing Influence of Journal Data Sharing Policies on Local RDM Practices


A Research Graph dataset for connecting research data repositories using RD-Switchboard

Authors : Amir Aryani, Marta Poblet, Kathryn Unsworth, Jingbo Wang, Ben Evans, Anusuriya Devaraju, Brigitte Hausstein, Claus-Peter Klas, Benjamin Zapilko, Samuele Kaplun

This paper describes the open access graph dataset that shows the connections between Dryad, CERN, ANDS and other international data repositories to publications and grants across multiple research data infrastructures.

The graph dataset was created using the Research Graph data model and the Research Data Switchboard (RD-Switchboard), a collaborative project by the Research Data Alliance DDRI Working Group (DDRI WG) with the aim to discover and connect the related research datasets based on publication co-authorship or jointly funded grants.

The graph dataset allows researchers to trace and follow the paths to understanding a body of work. By mapping the links between research datasets and related resources, the graph dataset improves both their discovery and visibility, while avoiding duplicate efforts in data creation.

Ultimately, the linked datasets may spur novel ideas, facilitate reproducibility and re-use in new applications, stimulate combinatorial creativity, and foster collaborations across institutions.

URL : A Research Graph dataset for connecting research data repositories using RD-Switchboard

Alternative location :