Text data mining and data quality management for research information systems in the context of open data and open science

Authors : Otmane Azeroual, Gunter Saake, Mohammad Abuosba, Joachim Schöpfel

In the implementation and use of research information systems (RIS) in scientific institutions, text data mining and semantic technologies are a key technology for the meaningful use of large amounts of data.

It is not the collection of data that is difficult, but the further processing and integration of the data in RIS. Data is usually not uniformly formatted and structured, such as texts and tables that cannot be linked.

These include various source systems with their different data formats such as project and publication databases, CERIF and RCD data model, etc. Internal and external data sources continue to develop.

On the one hand, they must be constantly synchronized and the results of the data links checked. On the other hand, the texts must be processed in natural language and certain information extracted.

Using text data mining, the quality of the metadata is analyzed and this identifies the entities and general keywords. So that the user is supported in the search for interesting research information.

The information age makes it easier to store huge amounts of data and increase the number of documents on the internet, in institutions’ intranets, in newswires and blogs is overwhelming.

Search engines should help to specifically open up these sources of information and make them usable for administrative and research purposes. Against this backdrop, the aim of this paper is to provide an overview of text data mining techniques and the management of successful data quality for RIS in the context of open data and open science in scientific institutions and libraries, as well as to provide ideas for their application. In particular, solutions for the RIS will be presented.

URL : https://arxiv.org/abs/1812.04298

Creating Structured Linked Data to Generate Scholarly Profiles: A Pilot Project using Wikidata and Scholia

Authors : Mairelys Lemus-Rojas, Jere D. Odell


Wikidata, a knowledge base for structured linked data, provides an open platform for curating scholarly communication data. Because all elements in a Wikidata entry are linked to defining elements and metadata, other web systems can harvest and display the data in meaningful ways.

Thus, Wikidata has the capacity to serve as the data source for faculty profiles. Scholia is an example of how third-party tools can leverage the power of Wikidata to provisde faculty profiles and bibliographic, data-driven visualizations.


In this article, we share our methods for contributing to Wikidata and displaying the data with Scholia.

We deployed these methods as part of a pilot project in which we contributed data about a small but unique school on the Indiana University-Purdue University Indianapolis (IUPUI) campus, the IU Lilly Family School of Philanthropy.


Following the completion of our pilot project, we aim to find additional methods for contributing large data collections to Wikidata. Specifically, we seek to contribute scholarly communication data that the library already maintains in other systems.

We are also facilitating Wikidata edit-a-thons to increase the library’s familiarity with the knowledge base and our capacity to contribute to the site.

URL : Creating Structured Linked Data to Generate Scholarly Profiles: A Pilot Project using Wikidata and Scholia

DOI : https://doi.org/10.7710/2162-3309.2272

Evaluation of a novel cloud-based software platform for structured experiment design and linked data analytics

Authors : Hannes Juergens, Matthijs Niemeijer, Laura D. Jennings-Antipov, Robert Mans, Jack More, Antonius J. A. van Maris, Jack T. Pronk, Timothy S. Gardner

Open data in science requires precise definition of experimental procedures used in data generation, but traditional practices for sharing protocols and data cannot provide the required data contextualization.

Here, we explore implementation, in an academic research setting, of a novel cloud-based software system designed to address this challenge. The software supports systematic definition of experimental procedures as visual processes, acquisition and analysis of primary data, and linking of data and procedures in machine-computable form.

The software was tested on a set of quantitative microbial-physiology experiments. Though time-intensive, definition of experimental procedures in the software enabled much more precise, unambiguous definitions of experiments than conventional protocols.

Once defined, processes were easily reusable and composable into more complex experimental flows. Automatic coupling of process definitions to experimental data enables immediate identification of correlations between procedural details, intended and unintended experimental perturbations, and experimental outcomes.

Software-based experiment descriptions could ultimately replace terse and ambiguous ‘Materials and Methods’ sections in scientific journals, thus promoting reproducibility and reusability of published studies.

URL : Evaluation of a novel cloud-based software platform for structured experiment design and linked data analytics

DOI : https://doi.org/10.1038/sdata.2018.195

L’ouverture des données publiques : un bien commun en devenir ?

Auteurs/Authors : Valérie Larroche, Marie-France Peyrelong, Philippe Beaune

Cet article interroge les données ouvertes en tant que bien commun. Le traitement préalable effectué sur les données à mettre à disposition permet de créer une ressource partagée et, à première vue, possède le potentiel pour être un bien commun. L’article relève plusieurs points d’achoppement qui nuancent cette affirmation.

Le premier argument provient des licences qui n’exigent pas du fournisseur de données en temps réel une continuité du service.

Le deuxième argument pointe le rôle du ré-utilisateur de la donnée qui ne participe pas à la gouvernance de la donnée.

Enfin, le dernier argument souligne le fait que les collectivités impliquées dans les communs urbains ne présentent pas l’open data comme tel.

Nos justifications sont le fruit d’analyses de portails de villes et d’entretiens menés auprès de ré-utilisateurs de données ouvertes.

URL : L’ouverture des données publiques : un bien commun en devenir ?

Alternative location : http://journals.openedition.org/ticetsociete/2466

Full Disclosure: Open Business Data and the Publisher’s Cookbook

Authors : Sebastian Nordhoff, Felix Kopecky

This short paper presents the three main outcomes of the OpenAire project “Full disclosure: replicable strategies for book publications supplemented with empirical data”: a fully specified business model; accountacy data; and a “cookbook” containing recipes how to set up a resilient community-based book publisher.

The provision of these items available for free reuse will allow other publishing projects to understand, adapt, and modify the community-based model of Language Science Press.

URL : https://hal.archives-ouvertes.fr/hal-01816822

Redistributing Data Worlds: Open Data, Data Infrastructures and Democracy

Author : Jonathan Gray

Open data, defined as a set of ideas and conventions that transform information into a reusable public resource, is promoted for various purposes: to improve the transparency of public institutions, to create projects that strengthen democracy, to stimulate economic growth.

The social and technical infrastructures that support open data recompose the “worlds of data”: new social collectives are formed, new practices creating meaning appear. Transnational political initiatives are emerging. Far from being a simple “release” of data, it does not go without translation, mediation, and new social practices.

But can this movement serve as a basis for a richer democratic deliberation, or is it destined to socially institutionalize various forms of bureaucratization and commodification?

URL : https://ssrn.com/abstract=3111720

L’horizon d’une culture de la donnée ouverte : de l’utopie aux pratiques de gouvernance des données

Auteur/Author : Anne Lehmans

Le développement des open data en France conduit les acteurs à s’interroger sur les stratégies et les pratiques de gestion des données à mettre en place dans les organisations concernées.

L’affichage d’une politique d’ouverture des données, dans une logique affirmée de transparence, de participation et d’innovation, est susceptible de bouleverser les routines dans les modes de gestion et de contrôle de la circulation de l’information.

Les principes et les formes de gouvernance des données font l’objet d’une réflexion renouvelée, l’ouverture des données faisant office de catalyseur pour introduire un principe de décision partagée dans le cycle de vie de la donnée.

Un projet de recherche sur la culture des données, partant d’une enquête qualitative sur les pratiques de gestion des données, montre que, face aux demandes, aux risques et aux avantages perçus dans l’agenda de l’ouverture et de la diffusion des données ouvertes, des stratégies variées de gouvernance des données s’installent, avec des effets sur le management de l’information et la gestion des connaissances.

URL : http://revue-cossi.info/numeros/n-1-2018-big-data-thick-data/708-1-2018-revue-lehmans