Transparency, provenance and collections as data: the National Library of Scotland’s Data Foundry

Author : Sarah Ames

‘Collections as data’ has become a core activity for libraries in recent years: it is important that we make collections available in machine-readable formats to enable and encourage computational research. However, while this is a necessary output, discussion around the processes and workflows required to turn collections into data, and to make collections data available openly, are just as valuable.

With libraries increasingly becoming producers of their own collections – presenting data from digitisation and digital production tools as part of datasets, for example – and making collections available at scale through mass-digitisation programmes, the trustworthiness of our processes comes into question.

In a world of big data, often of unclear origins, how can libraries be transparent about the ways in which collections are turned into data, how do we ensure that biases in our collections are recognised and not amplified, and how do we make these datasets available openly for reuse?

This paper presents a case study of work underway at the National Library of Scotland to present collections as data in an open and transparent way – from establishing a new Digital Scholarship Service, to workflows and online presentation of datasets.

It considers the changes to existing processes needed to produce the Data Foundry, the National Library of Scotland’s open data delivery platform, and explores the practical challenges of presenting collections as data online in an open, transparent and coherent manner.

URL : Transparency, provenance and collections as data: the National Library of Scotland’s Data Foundry

Original location : https://www.liberquarterly.eu/article/10.18352/lq.10371/

Modes d’évaluation ouverte par les pairs : de la revue à la plateforme

Auteurs/Authors : Evelyne Broudoux, Madjid Ihadjadene

Cet article a pour but de proposer un état de l’art des différentes formes de l’évaluation d’articles ou de communications par les pairs. De l’évaluation « aveugle» à l’évaluation « ouverte », de multiples possibilités existent et sont expérimentées.

C’est dans le champ des sciences que l’on trouve le plus d’innovations sociotechniques s’appuyant sur des plateformes de publication modélisant des workflows éditoriaux originaux.

L’ouverture de l’évaluation peut se produire entre pairs, en rendant publiques les identités et/ou les rapports des évaluateurs, à différents stades de l’article scientifique : préprint, en cours de rédaction, ou encore après publication.

Cet état de l’art est basé sur un ensemble de publications essentiellement produites par les acteurs de l’évaluation ouverte, issus principalement des disciplines STM.

URL : Modes d’évaluation ouverte par les pairs : de la revue à la plateforme

URL : https://revue-cossi.numerev.com/articles/revue-9/2496-modes-d-evaluation-ouverte-par-les-pairs-de-la-revue-a-la-plateforme

Prevalence of nonsensical algorithmically generated papers in the scientific literature

Authors : Guillaume Cabanac, Cyril Labbé

In 2014 leading publishers withdrew more than 120 nonsensical publications automatically generated with the SCIgen program. Casual observations suggested that similar problematic papers are still published and sold, without follow-up retractions.

No systematic screening has been performed and the prevalence of such nonsensical publications in the scientific literature is unknown. Our contribution is 2-fold.

First, we designed a detector that combs the scientific literature for grammar-based computer-generated papers. Applied to SCIgen, it has a 83.6% precision. Second, we performed a scientometric study of the 243 detected SCIgen-papers from 19 publishers.

We estimate the prevalence of SCIgen-papers to be 75 per million papers in Information and Computing Sciences. Only 19% of the 243 problematic papers were dealt with: formal retraction (12) or silent removal (34).

Publishers still serve and sometimes sell the remaining 197 papers without any caveat. We found evidence of citation manipulation via edited SCIgen bibliographies. This work reveals metric gaming up to the point of absurdity: fraudsters publish nonsensical algorithmically generated papers featuring genuine references.

It stresses the need to screen papers for nonsense before peer-review and chase citation manipulation in published papers. Overall, this is yet another illustration of the harmful effects of the pressure to publish or perish.

URL : Prevalence of nonsensical algorithmically generated papers in the scientific literature

DOI : https://doi.org/10.1002/asi.24495

Digital Object Identifier (DOI) Under the Context of Research Data Librarianship

AuthorJia Liu

A digital object identifier (DOI) is an increasingly prominent persistent identifier in finding and accessing scholarly information. This paper intends to present an overview of global development and approaches in the field of DOI and DOI services with a slight geographical focus on Germany.

At first, the initiation and components of the DOI system and the structure of a DOI name are explored. Next, the fundamental and specific characteristics of DOIs are described and DOIs for three (3) kinds of typical intellectual entities in the scholar communication are dealt with; then, a general DOI service pyramid is sketched with brief descriptions of functions of institutions at different levels.

After that, approaches of the research data librarianship community in the field of RDM, especially DOI services, are elaborated. As examples, the DOI services provided in German research libraries as well as best practices of DOI services in a German library are introduced; and finally, the current practices and some issues dealing with DOIs are summarized. It is foreseeable that DOI, which is crucial to FAIR research data, will gain extensive recognition in the scientific world.

URL : Digital Object Identifier (DOI) Under the Context of Research Data Librarianship

DOI : https://doi.org/10.7191/jeslib.2021.1180

Open access book usage data – how close is COUNTER to the other kind?

Author : Ronald Snijder

In April 2020, the OAPEN Library moved to a new platform, based on DSpace 6. During the same period, IRUS-UK started working on the deployment of Release 5 of the COUNTER Code of Practice (R5). This is, therefore, a good moment to compare two widely used usage metrics – R5 and Google Analytics (GA).

This article discusses the download data of close to 11,000 books and chapters from the OAPEN Library, from the period 15 April 2020 to 31 July 2020. When a book or chapter is downloaded, it is logged by GA and at the same time a signal is sent to IRUS-UK.

This results in two datasets: the monthly downloads measured in GA and the usage reported by R5, also clustered by month. The number of downloads reported by GA is considerably larger than R5. The total number of downloads in GA for the period is over 3.6 million.

In contrast, the amount reported by R5 is 1.5 million, around 400,000 downloads per month. Contrasting R5 and GA data on a country-by-country basis shows significant differences. GA lists more than five times the number of downloads for several countries, although the totals for other countries are about the same.

When looking at individual tiles, of the 500 highest ranked titles in GA that are also part of the 1,000 highest ranked titles in R5, only 6% of the titles are relatively close together. The choice of metric service has considerable consequences on what is reported.

Thus, drawing conclusions about the results should be done with care. One metric is not better than the other, but we should be open about the choices made. After all, open access book metrics are complicated, and we can only benefit from clarity.

URL : Open access book usage data – how close is COUNTER to the other kind?

DOI : http://doi.org/10.1629/uksg.539

Day-to-day discovery of preprint–publication links

Authors : Guillaume Cabanac, Theodora Oikonomidi, Isabelle Boutron

Preprints promote the open and fast communication of non-peer reviewed work. Once a preprint is published in a peer-reviewed venue, the preprint server updates its web page: a prominent hyperlink leading to the newly published work is added.

Linking preprints to publications is of utmost importance as it provides readers with the latest version of a now certified work. Yet leading preprint servers fail to identify all existing preprint–publication links.

This limitation calls for a more thorough approach to this critical information retrieval task: overlooking published evidence translates into partial and even inaccurate systematic reviews on health-related issues, for instance.

We designed an algorithm leveraging the Crossref public and free source of bibliographic metadata to comb the literature for preprint–publication links. We tested it on a reference preprint set identified and curated for a living systematic review on interventions for preventing and treating COVID-19 performed by international collaboration: the COVID-NMA initiative (covid-nma.com).

The reference set comprised 343 preprints, 121 of which appeared as a publication in a peer-reviewed journal. While the preprint servers identified 39.7% of the preprint–publication links, our linker identified 90.9% of the expected links with no clues taken from the preprint servers.

The accuracy of the proposed linker is 91.5% on this reference set, with 90.9% sensitivity and 91.9% specificity. This is a 16.26% increase in accuracy compared to that of preprint servers. We release this software as supplementary material to foster its integration into preprint servers’ workflows and enhance a daily preprint–publication chase that is useful to all readers, including systematic reviewers.

This preprint–publication linker currently provides day-to-day updates to the biomedical experts of the COVID-NMA initiative.

URL : Day-to-day discovery of preprint–publication links

DOI : https://doi.org/10.1007/s11192-021-03900-7