Formalizing Privacy Laws for License Generation and Data Repository Decision Automation

Authors : Micah Altman, Stephen Chong, Alexandra Wood

In this paper, we summarize work-in-progress on expert system support to automate some data deposit and release decisions within a data repository, and to generate custom license agreements for those data transfers.

Our approach formalizes via a logic programming language the privacy-relevant aspects of laws, regulations, and best practices, supported by legal analysis documented in legal memoranda.

This formalization enables automated reasoning about the conditions under which a repository can transfer data, through interrogation of users, and the application of formal rules to the facts obtained from users.

The proposed system takes the specific conditions for a given data release and produces a custom data use agreement that accurately captures the relevant restrictions on data use.

This enables appropriate decisions and accurate licenses, while removing the bottleneck of lawyer effort per data transfer.

The operation of the system aims to be transparent, in the sense that administrators, lawyers, institutional review boards, and other interested parties can evaluate the legal reasoning and interpretation embodied in the formalization, and the specific rationale for a decision to accept or release a particular dataset.

URL : https://arxiv.org/abs/1910.10096

A Discussion of Value Metrics for Data Repositories in Earth and Environmental Sciences

Authors : Cynthia Parr, Corinna Gries, Margaret O’Brien, Robert R. Downs, Ruth Duerr, Rebecca Koskela, Philip Tarrant, Keith E. Maull, Nancy Hoebelheinrich, Shelley Stall

Despite growing recognition of the importance of public data to the modern economy and to scientific progress, long-term investment in the repositories that manage and disseminate scientific data in easily accessible-ways remains elusive. Repositories are asked to demonstrate that there is a net value of their data and services to justify continued funding or attract new funding sources.

Here, representatives from a number of environmental and Earth science repositories evaluate approaches for assessing the costs and benefits of publishing scientific data in their repositories, identifying various metrics that repositories typically use to report on the impact and value of their data products and services, plus additional metrics that would be useful but are not typically measured.

We rated each metric by (a) the difficulty of implementation by our specific repositories and (b) its importance for value determination. As managers of environmental data repositories, we find that some of the most easily obtainable data-use metrics (such as data downloads and page views) may be less indicative of value than metrics that relate to discoverability and broader use.

Other intangible but equally important metrics (e.g., laws or regulations impacted, lives saved, new proposals generated), will require considerable additional research to describe and develop, plus resources to implement at scale.

As value can only be determined from the point of view of a stakeholder, it is likely that multiple sets of metrics will be needed, tailored to specific stakeholder needs. Moreover, economically based analyses or the use of specialists in the field are expensive and can happen only as resources permit.

URL : A Discussion of Value Metrics for Data Repositories in Earth and Environmental Sciences

DOI : http://doi.org/10.5334/dsj-2019-058

Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data

Authors : Agustin Barba, Santiago Dominguez, Carlos Cobas, David P. Martinsen, Charles Romain, Henry S. Rzepa; Felipe Seoane

There is an increasing focus on the part of academic institutions, funding agencies, and publishers, if not researchers themselves, on preservation and sharing of research data. Motivations for sharing include research integrity, replicability, and reuse.

One of the barriers to publishing data is the extra work involved in preparing data for publication once a journal article and its supporting information have been completed.

In this work, a method is described to generate both human and machine-readable supporting information directly from the primary instrumental data files and to generate the metadata to ensure it is published in accordance with findable, accessible, interoperable, and reusable (FAIR) guidelines.

Using this approach, both the human readable supporting information and the primary (raw) data can be submitted simultaneously with little extra effort.

Although traditionally the data package would be sent to a journal publisher for publication alongside the article, the data package could also be published independently in an institutional FAIR data repository.

Workflows are described that store the data packages and generate metadata appropriate for such a repository. The methods both to generate and to publish the data packages have been implemented for NMR data, but the concept is extensible to other types of spectroscopic data as well.

URL : Workflows Allowing Creation of Journal Article Supporting Information and Findable, Accessible, Interoperable, and Reusable (FAIR)-Enabled Publication of Spectroscopic Data

DOI : https://doi.org/10.1021/acsomega.8b03005

Are Research Datasets FAIR in the Long Run?

Authors : Dennis Wehrle, Klaus Rechert

Currently, initiatives in Germany are developing infrastructure to accept and preserve dissertation data together with the dissertation texts (on state level – bwDATA Diss, on federal level – eDissPlus).

In contrast to specialized data repositories, these services will accept data from all kind of research disciplines. To ensure FAIR data principles (Wilkinson et al., 2016), preservation plans are required, because ensuring accessibility, interoperability and re-usability even for a minimum ten year data redemption period can become a major challenge.

Both for longevity and re-usability, file formats matter. In order to ensure access to data, the data’s encoding, i.e. their technical and structural representation in form of file formats, needs to be understood. Hence, due to a fast technical lifecycle, interoperability, re-use and in some cases even accessibility depends on the data’s format and our future ability to parse or render these.

This leads to several practical questions regarding quality assurance, potential access options and necessary future preservation steps. In this paper, we analyze datasets from public repositories and apply a file format based long-term preservation risk model to support workflows and services for non-domain specific data repositories.

URL : Are Research Datasets FAIR in the Long Run?

DOI : https://doi.org/10.2218/ijdc.v13i1.659

Blockchain and OECD data repositories: opportunities and policymaking implications

Authors : Miguel-Angel Sicilia, Anna Visvizi

Purpose

The purpose of this paper is to employ the case of Organization for Economic Cooperation and Development (OECD) data repositories to examine the potential of blockchain technology in the context of addressing basic contemporary societal concerns, such as transparency, accountability and trust in the policymaking process. Current approaches to sharing data employ standardized metadata, in which the provider of the service is assumed to be a trusted party.

However, derived data, analytic processes or links from policies, are in many cases not shared in the same form, thus breaking the provenance trace and making the repetition of analysis conducted in the past difficult. Similarly, it becomes tricky to test whether certain conditions justifying policies implemented still apply.

A higher level of reuse would require a decentralized approach to sharing both data and analytic scripts and software. This could be supported by a combination of blockchain and decentralized file system technology.

Design/methodology/approach

The findings presented in this paper have been derived from an analysis of a case study, i.e., analytics using data made available by the OECD. The set of data the OECD provides is vast and is used broadly.

The argument is structured as follows. First, current issues and topics shaping the debate on blockchain are outlined. Then, a redefinition of the main artifacts on which some simple or convoluted analytic results are based is revised for some concrete purposes.

The requirements on provenance, trust and repeatability are discussed with regards to the architecture proposed, and a proof of concept using smart contracts is used for reasoning on relevant scenarios.

Findings

A combination of decentralized file systems and an open blockchain such as Ethereum supporting smart contracts can ascertain that the set of artifacts used for the analytics is shared. This enables the sequence underlying the successive stages of research and/or policymaking to be preserved.

This suggests that, in turn, and ex post, it becomes possible to test whether evidence supporting certain findings and/or policy decisions still hold. Moreover, unlike traditional databases, blockchain technology makes it possible that immutable records can be stored.

This means that the artifacts can be used for further exploitation or repetition of results. In practical terms, the use of blockchain technology creates the opportunity to enhance the evidence-based approach to policy design and policy recommendations that the OECD fosters.

That is, it might enable the stakeholders not only to use the data available in the OECD repositories but also to assess corrections to a given policy strategy or modify its scope.

Research limitations/implications

Blockchains and related technologies are still maturing, and several questions related to their use and potential remain underexplored. Several issues require particular consideration in future research, including anonymity, scalability and stability of the data repository.

This research took as example OECD data repositories, precisely to make the point that more research and more dialogue between the research and policymaking community is needed to embrace the challenges and opportunities blockchain technology generates.

Several questions that this research prompts have not been addressed. For instance, the question of how the sharing economy concept for the specifics of the case could be employed in the context of blockchain has not been dealt with.

Practical implications

The practical implications of the research presented here can be summarized in two ways. On the one hand, by suggesting how a combination of decentralized file systems and an open blockchain, such as Ethereum supporting smart contracts, can ascertain that artifacts are shared, this paper paves the way toward a discussion on how to make this approach and solution reality.

The approach and architecture proposed in this paper would provide a way to increase the scope of the reuse of statistical data and results and thus would improve the effectiveness of decision making as well as the transparency of the evidence supporting policy.

Social implications

Decentralizing analytic artifacts will add to existing open data practices an additional layer of benefits for different actors, including but not limited to policymakers, journalists, analysts and/or researchers without the need to establish centrally managed institutions.

Moreover, due to the degree of decentralization and absence of a single-entry point, the vulnerability of data repositories to cyberthreats might be reduced. Simultaneously, by ensuring that artifacts derived from data based in those distributed depositories are made immutable therein, full reproducibility of conclusions concerning the data is possible.

In the field of data-driven policymaking processes, it might allow policymakers to devise more accurate ways of addressing pressing issues and challenges.

Originality/value

This paper offers the first blueprint of a form of sharing that complements open data practices with the decentralized approach of blockchain and decentralized file systems.

The case of OECD data repositories is used to highlight that while data storing is important, the real added value of blockchain technology rests in the possible change on how we use the data and data sets in the repositories. It would eventually enable a more transparent and actionable approach to linking policy up with the supporting evidence.

From a different angle, throughout the paper the case is made that rather than simply data, artifacts from conducted analyses should be made persistent in a blockchain.

What is at stake is the full reproducibility of conclusions based on a given set of data, coupled with the possibility of ex post testing the validity of the assumptions and evidence underlying those conclusions.

DOI : https://doi.org/10.1108/LHT-12-2017-0276

Data Discovery Paradigms: User Requirements and Recommendations for Data Repositories

Authors: Mingfang Wu, Fotis Psomopoulos, Siri Jodha Khalsa, Anita de Waard

As data repositories make more data openly available it becomes challenging for researchers to find what they need either from a repository or through web search engines.

This study attempts to investigate data users’ requirements and the role that data repositories can play in supporting data discoverability by meeting those requirements.

We collected 79 data discovery use cases (or data search scenarios), from which we derived nine functional requirements for data repositories through qualitative analysis.

We then applied usability heuristic evaluation and expert review methods to identify best practices that data repositories can implement to meet each functional requirement.

We propose the following ten recommendations for data repository operators to consider for improving data discoverability and user’s data search experience:

1. Provide a range of query interfaces to accommodate various data search behaviours.

2. Provide multiple access points to find data.

3. Make it easier for researchers to judge relevance, accessibility and reusability of a data collection from a search summary.

4. Make individual metadata records readable and analysable.

5. Enable sharing and downloading of bibliographic references.

6. Expose data usage statistics.

7. Strive for consistency with other repositories.

8. Identify and aggregate metadata records that describe the same data object.

9. Make metadata records easily indexed and searchable by major web search engines.

10. Follow API search standards and community adopted vocabularies for interoperability.

DOI : http://doi.org/10.5334/dsj-2019-003

Facilitating and Improving Environmental Research Data Repository Interoperability

Authors : Corinna Gries, Amber Budden, Christine Laney, Margaret O’Brien, Mark Servilla, Wade Sheldon, Kristin Vanderbilt, David Vieglais

Environmental research data repositories provide much needed services for data preservation and data dissemination to diverse communities with domain specific or programmatic data needs and standards.

Due to independent development these repositories serve their communities well, but were developed with different technologies, data models and using different ontologies. Hence, the effectiveness and efficiency of these services can be vastly improved if repositories work together adhering to a shared community platform that focuses on the implementation of agreed upon standards and best practices for curation and dissemination of data.

Such a community platform drives forward the convergence of technologies and practices that will advance cross-domain interoperability. It will also facilitate contributions from investigators through standardized and streamlined workflows and provide increased visibility for the role of data managers and the curation services provided by data repositories, beyond preservation infrastructure.

Ten specific suggestions for such standardizations are outlined without any suggestions for priority or technical implementation. Although the recommendations are for repositories to implement, they have been chosen specifically with the data provider/data curator and synthesis scientist in mind.

URL : Facilitating and Improving Environmental Research Data Repository Interoperability

DOI : http://doi.org/10.5334/dsj-2018-022