Open Data Protection : Study on legal barriers to open data sharing – Data Protection and PSI

Authors : Andreas Wiebe, Nils Dietrich

This study analyses legal barriers to data sharing in the context of the Open Research Data Pilot, which the European Commission is running within its research framework programme Horizon2020.

In the first part of the study, data protection issues are analysed. After a brief overview of the international basis for data protection, the European legal framework is described in detail.

The main focus is thus on the Data Protection Directive (95/46/EC), which has been in force since 1995. Not only is the Data Protection Directive itself described, but also its implementation in selected EU Member States.

Additionally, the upcoming General Data Protection Regulation (2016/679/EU) and relevant changes are described. Special focus is placed on leading data protection principles. Next, the study describes the use of research data in the Open Research Data Pilot and how data protection principles influence such use.

The experiences of the European Commission in running the Open Research Data Pilot so far, as well as basic examples of repository use forms, are considered. The second part of the study analyses the extent to which legislation on public sector information (PSI) influences access to and re-use of research data.

The Public Sector Information Directive (2003/98/EC) and the impact of its revision in 2013 (2013/37/EU) are described. There is a special focus on the application of PSI legislation to public libraries, including university and research libraries, and its practical implications.

In the final part of the study the results are critically evaluated and core recommendations are made to improve the legal situation in relation to research data.

Recommendations to Improve Downloads of Large Earth Observation Data

Authors : Rahul Ramachandran, Christopher Lynnes, Kathleen Baynes, Kevin Murphy, Jamie Baker, Jamie Kinney, Ariel Gold, Jed Sundwall, Mark Korver, Allison Lieber, William Vambenepe, Matthew Hancher,  Rebecca Moore, Tyler Erickson, Josh Henretig,
Brant Zwiefel, Heather Patrick-Ahlstrom, Matthew J. Smith

With the volume of Earth observation data expanding rapidly, cloud computing is quickly changing the way these data are processed, analyzed, and visualized. Collocating freely available Earth observation data on a cloud computing infrastructure may create opportunities unforeseen by the original data provider for innovation and value-added data re-use, but existing systems at data centers are not designed for supporting requests for large data transfers.

A lack of common methodology necessitates that each data center handle such requests from different cloud vendors differently. Guidelines are needed to support enabling all cloud vendors to utilize a common methodology for bulk-downloading data from data centers, thus preventing the providers from building custom capabilities to meet the needs of individual vendors.

This paper presents recommendations distilled from use cases provided by three cloud vendors (Amazon, Google, and Microsoft) and are based on the vendors’ interactions with data systems at different Federal agencies and organizations.

These specific recommendations range from obvious steps for improving data usability (such as ensuring the use of standard data formats and commonly supported projections) to non-obvious undertakings important for enabling bulk data downloads at scale.

These recommendations can be used to evaluate and improve existing data systems for high-volume data transfers, and their adoption can lead to cloud vendors utilizing a common methodology.

Understanding Data Retrieval Practices: A Social Informatics Perspective

Authors : Kathleen Gregory, Helena Cousijn, Paul Groth, Andrea Scharnhorst, Sally Wyatt

Open research data are heralded as having the potential to increase effectiveness, productivity, and reproducibility in science, but little is known about the actual practices involved in data search and retrieval.

The socio-technical problem of locating data for (re)use is often reduced to the technological dimension of designing data search systems. In this article, we explore how a social informatics perspective can help to better analyze the current academic discourse about data retrieval as well as to study user practices and behaviors.

We employ two methods in our analysis – bibliometrics and interviews with data seekers – and conclude with a discussion of the implications of our findings for designing data discovery systems.


DataMed – an open source discovery index for finding biomedical datasets

Authors : Xiaoling Chen, Anupama E Gururaj, Burak Ozyurt, Ruiling Liu, Ergin Soysal, Trevor Cohen, Firat Tiryaki, Yueling Li, Nansu Zong, Min Jiang, Deevakar Rogith, Mandana Salimi, Hyeon-eui Kim, Philippe Rocca-Serra, Alejandra Gonzalez-Beltran, Claudiu Farcas, Todd Johnson, Ron Margolis, George Alter, Susanna-Assunta Sansone, Ian M Fore, Lucila Ohno-Machado, Jeffrey S Grethe, Hua Xu


Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain.

Materials and Methods

DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health–funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium.

It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries.

In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine.

Results and Conclusion

Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services.

Currently, we have made the DataMed system publically available as an open source package for the biomedical community.



Attitudes and norms affecting scientists’ data reuse

Authors : Renata Gonçalves Curty, Kevin Crowston, Alison Specht, Bruce W. Grant, Elizabeth D. Dalton

The value of sharing scientific research data is widely appreciated, but factors that hinder or prompt the reuse of data remain poorly understood. Using the Theory of Reasoned Action, we test the relationship between the beliefs and attitudes of scientists towards data reuse, and their self-reported data reuse behaviour.

To do so, we used existing responses to selected questions from a worldwide survey of scientists developed and administered by the DataONE Usability and Assessment Working Group (thus practicing data reuse ourselves).

Results show that the perceived efficacy and efficiency of data reuse are strong predictors of reuse behaviour, and that the perceived importance of data reuse corresponds to greater reuse. Expressed lack of trust in existing data and perceived norms against data reuse were not found to be major impediments for reuse contrary to our expectations.

We found that reported use of models and remotely-sensed data was associated with greater reuse. The results suggest that data reuse would be encouraged and normalized by demonstration of its value.

We offer some theoretical and practical suggestions that could help to legitimize investment and policies in favor of data sharing.

Data-Sprinting: a Public Approach to Digital Research

Authors : Tommaso Venturini, Anders Munk, Axel Meunier

This chapter is about the politics of interdisciplinarity. Not in the sense of the research politics fostering collaboration across disciplines, but in the stronger sense of transcending disciplinary boundaries to make significant political contributions.

In short: it is about making research public. To address this question, this chapter introduces (through a concrete example in climate debate research) an original research format, that we call data-sprinting.