Amplifying Data Curation Efforts to Improve the Quality of Life Science Data

Authors : Mariam Alqasab, Suzanne M. Embury, Sandra de F. Mendes Sampaio

In the era of data science, datasets are shared widely and used for many purposes unforeseen by the original creators of the data. In this context, defects in datasets can have far reaching consequences, spreading from dataset to dataset, and affecting the consumers of data in ways that are hard to predict or quantify.

Some form of waste is often the result. For example, scientists using defective data to propose hypotheses for experimentation may waste their limited wet lab resources chasing the wrong experimental targets. Scarce drug trial resources may be used to test drugs that actually have little chance of giving a cure.

Because of the potential real world costs, database owners care about providing high quality data. Automated curation tools can be used to an extent to discover and correct some forms of defect.

However, in some areas human curation, performed by highly-trained domain experts, is needed to ensure that the data represents our current interpretation of reality accurately.

Human curators are expensive, and there is far more curation work to be done than there are curators available to perform it. Tools and techniques are needed to enable the full value to be obtained from the curation effort currently available.

In this paper,we explore one possible approach to maximising the value obtained from human curators, by automatically extracting information about data defects and corrections from the work that the curators do.

This information is packaged in a source independent form, to allow it to be used by the owners of other databases (for which human curation effort is not available or is insufficient).

This amplifies the efforts of the human curators, allowing their work to be applied to other sources, without requiring any additional effort or change in their processes or tool sets. We show that this approach can discover significant numbers of defects, which can also be found in other sources.

URL : Amplifying Data Curation Efforts to Improve the Quality of Life Science Data


Connecting Data Publication to the Research Workflow: A Preliminary Analysis

Authors : Sünje Dallmeier-Tiessen, Varsha Khodiyar, Fiona Murphy, Amy Nurnberger, Lisa Raymond, Angus Whyte

The data curation community has long encouraged researchers to document collected research data during active stages of the research workflow, to provide robust metadata earlier, and support research data publication and preservation.

Data documentation with robust metadata is one of a number of steps in effective data publication. Data publication is the process of making digital research objects ‘FAIR’, i.e. findable, accessible, interoperable, and reusable; attributes increasingly expected by research communities, funders and society.

Research data publishing workflows are the means to that end. Currently, however, much published research data remains inconsistently and inadequately documented by researchers.

Documentation of data closer in time to data collection would help mitigate the high cost that repositories associate with the ingest process. More effective data publication and sharing should in principle result from early interactions between researchers and their selected data repository.

This paper describes a short study undertaken by members of the Research Data Alliance (RDA) and World Data System (WDS) working group on Publishing Data Workflows. We present a collection of recent examples of data publication workflows that connect data repositories and publishing platforms with research activity ‘upstream’ of the ingest process.

We re-articulate previous recommendations of the working group, to account for the varied upstream service components and platforms that support the flow of contextual and provenance information downstream.

These workflows should be open and loosely coupled to support interoperability, including with preservation and publication environments. Our recommendations aim to stimulate further work on researchers’ views of data publishing and the extent to which available services and infrastructure facilitate the publication of FAIR data.

We also aim to stimulate further dialogue about, and definition of, the roles and responsibilities of research data services and platform providers for the ‘FAIRness’ of research data publication workflows themselves.

URL : Connecting Data Publication to the Research Workflow: A Preliminary Analysis


Rethinking Data Sharing and Human Participant Protection in Social Science Research: Applications from the Qualitative Realm

Authors : Dessi Kirilova, Sebastian Karcher

While data sharing is becoming increasingly common in quantitative social inquiry, qualitative data are rarely shared. One factor inhibiting data sharing is a concern about human participant protections and privacy.

Protecting the confidentiality and safety of research participants is a concern for both quantitative and qualitative researchers, but it raises specific concerns within the epistemic context of qualitative research.

Thus, the applicability of emerging protection models from the quantitative realm must be carefully evaluated for application to the qualitative realm. At the same time, qualitative scholars already employ a variety of strategies for human-participant protection implicitly or informally during the research process.

In this practice paper, we assess available strategies for protecting human participants and how they can be deployed. We describe a spectrum of possible data management options, such as de-identification and applying access controls, including some already employed by the Qualitative Data Repository (QDR) in tandem with its pilot depositors.

Throughout the discussion, we consider the tension between modifying data or restricting access to them, and retaining their analytic value.

We argue that developing explicit guidelines for sharing qualitative data generated through interaction with humans will allow scholars to address privacy concerns and increase the secondary use of their data.

URL : Rethinking Data Sharing and Human Participant Protection in Social Science Research: Applications from the Qualitative Realm



What do data curators care about? Data quality, user trust, and the data reuse plan

Author : Frank Andreas Sposito

Data curation is often defined as the practice of maintaining, preserving, and enhancing research data for long-term value and reusability. The role of data reuse in the data curation lifecycle is critical: increased reuse is the core justification for the often sizable expenditures necessary to build data management infrastructures and user services.

Yet recent studies have shown that data are being shared and reused through open data repositories at much lower levels than expected. These studies underscore a fundamental and often overlooked challenge in research data management that invites deeper examination of the roles and responsibilities of data curators.

This presentation will identify key barriers to data reuse, data quality and user trust, and propose a framework for implementing reuser-centric strategies to increase data reuse.

Using the concept of a « data reuse plan » it will highlight repository-based approaches to improve data quality and user trust, and address critical areas for innovation for data curators working in the absence of repository support.

URL : What do data curators care about? Data quality, user trust, and the data reuse plan

Alternative location :


Strengthening institutional data management and promoting data sharing in the social and economic sciences

Authors : Monika Linne, Wolfgang Zenk-Möltgen

In the German social and economic sciences there is a growing awareness of flexible data distribution and research data reuse, especially as increasing numbers of research funders recommend publishing research data as the basis for scientific insight.

However, a data-sharing mentality has not yet been established in Germany attributable to researchers’ strong reservations about publishing their data.

This attitude is exacerbated by the fact that, at present, there is no trusted national data sharing repository that covers the particular requirements of institutions regarding research data.

This article discusses how this objective can be achieved with the project initiative SowiDataNet.

The development of a community-driven data repository is a logically consistent and important step towards an attitude shift concerning data sharing in the social and economic sciences.


Developments in research data management in academic libraries: Towards an understanding of research data service maturity

Authors : Andrew M. Cox, Mary Anne Kennan, Liz Lyon, Stephen Pinfield

This paper reports an international study of research data management (RDM) activities, services and capabilities in higher education libraries. It presents the results of a survey covering higher education libraries in Australia, Canada, Germany, Ireland, the Netherlands, New Zealand and the UK.

The results indicate that libraries have provided leadership in RDM, particularly in advocacy and policy development. Service development is still limited, focused especially on advisory and consultancy services (such as data management planning support and data-related training), rather than technical services (such as provision of a data catalogue, and curation of active data).

Data curation skills development is underway in libraries, but skills and capabilities are not consistently in place and remain a concern. Other major challenges include resourcing, working with other support services, and achieving ‘buy in’ from researchers and senior managers.

Results are compared with previous studies in order to assess trends and relative maturity levels. The range of RDM activities explored in this study are positioned on a ‘landscape maturity model’, which reflects current and planned research data services and practice in academic libraries, representing a ‘snapshot’ of current developments and a baseline for future research.


Collaboration to Data Curation: Harnessing Institutional Expertise

It can be argued that institutional repositories have not had the impact (Lynch 2003; Salo 2008), initially expected, on academic scholarly communications (the exception being in a few well-developed and successful instances).

So why should data repositories expect to fare any better? First, data repositories can learn from publication repositories’ experiences and their efforts to engage researchers to accept and use these new institutional services.

Second, they provide a technical infrastructure for storing and sharing data with the potential for providing access to complimentary research support facilities. Finally, due to the interdisciplinary expertise required to develop and maintain such systems, stronger ties will be forged between libraries, information and computing services, and researchers.

This will assist innovation and help to make them sustainable and embedded within academic institutional policy.

This paper, while aware of the diverse nature of institutional and departmental practices, aims to highlight a number of initiatives in the Universities of Edinburgh and Oxford, showing how research data repository infrastructures can be effectively realized through collaboration and sharing of expertise.

We argue that by employing agile community, strategic and policy judgment, a robust data repository infrastructure will be part of an integrated solution to effectively manage institutional research data assets. »