Revisiting the Data Lifecycle with Big Data Curation

Author : Line Pouchard

As science becomes more data-intensive and collaborative, researchers increasingly use larger and more complex data to answer research questions.

The capacity of storage infrastructure, the increased sophistication and deployment of sensors, the ubiquitous availability of computer clusters, the development of new analysis techniques, and larger collaborations allow researchers to address grand societal challenges in a way that is unprecedented.

In parallel, research data repositories have been built to host research data in response to the requirements of sponsors that research data be publicly available. Libraries are re-inventing themselves to respond to a growing demand to manage, store, curate and preserve the data produced in the course of publicly funded research.

As librarians and data managers are developing the tools and knowledge they need to meet these new expectations, they inevitably encounter conversations around Big Data. This paper explores definitions of Big Data that have coalesced in the last decade around four commonly mentioned characteristics: volume, variety, velocity, and veracity.

We highlight the issues associated with each characteristic, particularly their impact on data management and curation. We use the methodological framework of the data life cycle model, assessing two models developed in the context of Big Data projects and find them lacking.

We propose a Big Data life cycle model that includes activities focused on Big Data and more closely integrates curation with the research life cycle. These activities include planning, acquiring, preparing, analyzing, preserving, and discovering, with describing the data and assuring quality being an integral part of each activity.

We discuss the relationship between institutional data curation repositories and new long-term data resources associated with high performance computing centers, and reproducibility in computational science.

We apply this model by mapping the four characteristics of Big Data outlined above to each of the activities in the model. This mapping produces a set of questions that practitioners should be asking in a Big Data project

URL : Revisiting the Data Lifecycle with Big Data Curation

Alternative location :

Exploring the opportunities and challenges of implementing open research strategies within development institutions

This research proposal calls for support for a pilot project to conduct open data pilot case studies with eight (8) IDRC grantees to develop and implement open data management and sharing plans.

The results of the case studies will serve to refine guidelines for the implementation of development research funders’ open research data policies. The case studies will examine the scale of legal, ethical and technical challenges that might limit the sharing of data from IDRC projects including issues of:

  • Privacy, personally identifiable information and protection of human subject
  • Protection of intellectual property generated from projects or potential for financial risks for projects or institutions
  • Challenges in the local legal environment, including ownership of data
  • Ethical issues in releasing or sharing of indigenous and community knowledge, and the relationship between project participants and investigators particularly in the context of historical expropriation of resources
  • Local and global issues of capacity and expertise in the management and sharing of data

The duration of the current project will be fifteen (16) months, commencing September 2015 and ending in December 2016. The project will focus on auditing the data being produced by the participating projects, supporting the development of data management and sharing plans, and surfacing and cataloguing issues that arise.

URL : Exploring the opportunities and challenges of implementing open research strategies within development institutions

Alternative location :


Assessment of and Response to Data Needs of Clinical and Translational Science Researchers and Beyond

Objective and Setting

As universities and libraries grapple with data management and “big data,” the need for data management solutions across disciplines is particularly relevant in clinical and translational science (CTS) research, which is designed to traverse disciplinary and institutional boundaries.

At the University of Florida Health Science Center Library, a team of librarians undertook an assessment of the research data management needs of CTS researchers, including an online assessment and follow-up one-on-one interviews.

Design and Methods

The 20-question online assessment was distributed to all investigators affiliated with UF’s Clinical and Translational Science Institute (CTSI) and 59 investigators responded. Follow-up in-depth interviews were conducted with nine faculty and staff members.


Results indicate that UF’s CTS researchers have diverse data management needs that are often specific to their discipline or current research project and span the data lifecycle. A common theme in responses was the need for consistent data management training, particularly for graduate students; this led to localized training within the Health Science Center and CTSI, as well as campus-wide training.

Another campus-wide outcome was the creation of an action-oriented Data Management/Curation Task Force, led by the libraries and with participation from Research Computing and the Office of Research.


Initiating conversations with affected stakeholders and campus leadership about best practices in data management and implications for institutional policy shows the library’s proactive leadership and furthers our goal to provide concrete guidance to our users in this area.

URL : Assessment of and Response to Data Needs of Clinical and Translational Science Researchers and Beyond

Alternative location :

Data Management Plan Requirements for Campus Grant Competitions: Opportunities for Research Data Services Assessment and Outreach


To examine the effects of research data services (RDS) on the quality of data management plans (DMPs) required for a campus-level faculty grant competition, as well as to explore opportunities that the local DMP requirement presented for RDS outreach.


Nine reviewers each scored a randomly assigned portion of DMPs from 82 competition proposals. Each DMP was scored by three reviewers, and the three scores were averaged together to obtain the final score. Interrater reliability was measured using intraclass correlation.

Unpaired t-tests were used to compare mean DMP scores for faculty who utilized RDS services with those who did not. Unpaired t-tests were also used to compare mean DMP scores for proposals that were funded with proposals that were not funded. One-way ANOVA was used to compare mean DMP scores among proposals from six broad disciplinary categories.


Analyses showed that RDS consultations had a statistically significant effect on DMP scores. Differences between DMP scores for funded versus unfunded proposals and among disciplinary categories were not significant. The DMP requirement also provided a number of both expected and unexpected outreach opportunities for RDS services.


Requiring DMPs for campus grant competitions can provide important assessment and outreach opportunities for research data services.

While these results might not be generalizable to DMP review processes at federal funding agencies, they do suggest the importance, at any level, of developing a shared understanding of what constitutes a high quality DMP among grant applicants, grant reviewers, and RDS providers.

URL : Data Management Plan Requirements for Campus Grant Competitions


Les enjeux de la patrimonialisation et de la réutilisation des données qualitatives de la recherche en Sciences humaines et sociales

Les archives de la recherche sont par nature passionnantes puisqu’elles permettent de comprendre comment les découvertes se font et comment la science évolue de jour en jour. L’arrivée du numérique a fait surgir de nouvelles possibilités pour la diffusion notamment de ces données, mais aussi de nouveaux challenges, en termes d’archivage entre autres.

L’archivage, le partage et la réutilisation des données qualitatives des SHS soulèvent de nombreuses questions et les différents acteurs concernés, les professionnels de l’IST et les chercheurs, peuvent avoir des avis divergents. Comprendre les points de vue de chacun et déterminer dans quelle mesure celles-ci peuvent être compatibles sont les enjeux de ce mémoire.

URL : Les enjeux de la patrimonialisation et de la réutilisation des données qualitatives de la recherche en Sciences humaines et sociales

Alternative location :

DataCite au service des données scientifiques : Identifier pour valoriser

Les données de la recherche, sous forme d’objets numériques très divers, sont en train de trouver leur place dans les services d’information scientifique et technique (IST), principalement – mais pas uniquement – comme compléments des publications qui s’appuient sur ces données.

L’intégration de différents types de ressources numériques avance, et doit être accompagnée par des standards d’interopérabilité, des formats communs de métadonnées et des possibilités de lier ces contenus entre eux et de les citer de manière normalisée.

Le consortium international DataCite, dans lequel l’Inist-CNRS représente la France, s’est mis comme objectif de soutenir et accélérer cette évolution. Il opère en particulier comme une agence d’enregistrement de DOI (Digital Object Identifier), considérant ces DOI, déjà bien établis dans le monde de l’édition, comme un outil efficace pour identifier les données de manière pérenne, pour ainsi faciliter leur découverte et pour y accéder, et puis pour les citer.

DataCite a développé son propre schéma de métadonnées et a mis en place des fonctionnalités spécifiques qui favorisent le partage et la réutilisation des données. Une telle valorisation s’inscrit en particulier dans une approche de pleinement bénéficier du potentiel des open data.

Elle est aussi une contribution essentielle à une meilleure reconnaissance du travail scientifique de production, gestion et mise à disposition de données, et notamment sa prise en compte dans les critères d’évaluation.

Il est d’ailleurs encourageant de voir que ces critères se s’ouvrent à des métriques alternatives, y compris celles concernant les données. Le sujet particulier de la citation des données a récemment été l’objet de plusieurs initiatives internationales visant à harmoniser les pratiques et émettre des recommandations.

Elles ont convergé, à travers le Data Citation Synthesis Group, vers quelques principes en train d’être largement reconnus et acceptés. Dans ce contexte, les éditeurs doivent s’adapter et clairement définir leurs politiques en termes de liens entre données et publications. On observe d’ailleurs une tendance forte vers des accords entre éditeurs et réservoirs de données.

Les actions et services de DataCite s’intègrent dans d’autres structures et initiatives internationales mises en place autour des données de la recherche et des identifiants pérennes: Research Data Alliance, WDS-ICSU, CODATA, EPIC, Data Citation Index, etc.

Un exemple particulier présente le projet européen ODIN, où DataCite et l’initiative ORCID pour la création d’identifiants d’auteurs tentent de connecter les différents types d’identifiants.

URL  :

The FAIR Guiding Principles for scientific data management and stewardship

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles.

The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.

This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

URL : The FAIR Guiding Principles for scientific data management and stewardship

Alternative location :