ORCID growth and field-wise dynamics of adoption: A case study of the Toulouse scientific area

Authors : Marie-Dominique Heusse, Guillaume Cabanac

Research-focused information systems harvest and promote the scientific output of researchers. Disambiguating author identities is key when disentangling homonyms to avoid merging several persons’ records.

ORCID offers an identifier to link one’s identity, affiliations and bibliography. While funding agencies and scholarly publishers promote ORCID, little is known about its adoption rate. We introduce a method to quantify ORCID adoption according to researchers’ discipline and occupation in a higher-education organization.

We semi-automatically matched the 6,607 staff members affiliated to the 145 labs of the Toulouse scientific area with the 7.3 million profiles at orcid.org. The observed ORCID adoption of 41.8% comes with discipline-wise disparities. Unexpectedly, only 48.3% of all profiles listed at least one work and profiles with no works might just have been created to get an identifier.

Those ‘empty’ profiles are of little interest for the entity disambiguation task. To our knowledge, this is the first study of ORCID adoption at the scale of a multidisciplinary scientific metropole. This method is replicable and future studies can target other cases to contrast the dynamics of ORCID adoption worldwide.

URL : ORCID growth and field-wise dynamics of adoption: A case study of the Toulouse scientific area

DOI : https://doi.org/10.1002/leap.1451

Prevalence of nonsensical algorithmically generated papers in the scientific literature

Authors : Guillaume Cabanac, Cyril Labbé

In 2014 leading publishers withdrew more than 120 nonsensical publications automatically generated with the SCIgen program. Casual observations suggested that similar problematic papers are still published and sold, without follow-up retractions.

No systematic screening has been performed and the prevalence of such nonsensical publications in the scientific literature is unknown. Our contribution is 2-fold.

First, we designed a detector that combs the scientific literature for grammar-based computer-generated papers. Applied to SCIgen, it has a 83.6% precision. Second, we performed a scientometric study of the 243 detected SCIgen-papers from 19 publishers.

We estimate the prevalence of SCIgen-papers to be 75 per million papers in Information and Computing Sciences. Only 19% of the 243 problematic papers were dealt with: formal retraction (12) or silent removal (34).

Publishers still serve and sometimes sell the remaining 197 papers without any caveat. We found evidence of citation manipulation via edited SCIgen bibliographies. This work reveals metric gaming up to the point of absurdity: fraudsters publish nonsensical algorithmically generated papers featuring genuine references.

It stresses the need to screen papers for nonsense before peer-review and chase citation manipulation in published papers. Overall, this is yet another illustration of the harmful effects of the pressure to publish or perish.

URL : Prevalence of nonsensical algorithmically generated papers in the scientific literature

DOI : https://doi.org/10.1002/asi.24495

Day-to-day discovery of preprint–publication links

Authors : Guillaume Cabanac, Theodora Oikonomidi, Isabelle Boutron

Preprints promote the open and fast communication of non-peer reviewed work. Once a preprint is published in a peer-reviewed venue, the preprint server updates its web page: a prominent hyperlink leading to the newly published work is added.

Linking preprints to publications is of utmost importance as it provides readers with the latest version of a now certified work. Yet leading preprint servers fail to identify all existing preprint–publication links.

This limitation calls for a more thorough approach to this critical information retrieval task: overlooking published evidence translates into partial and even inaccurate systematic reviews on health-related issues, for instance.

We designed an algorithm leveraging the Crossref public and free source of bibliographic metadata to comb the literature for preprint–publication links. We tested it on a reference preprint set identified and curated for a living systematic review on interventions for preventing and treating COVID-19 performed by international collaboration: the COVID-NMA initiative (covid-nma.com).

The reference set comprised 343 preprints, 121 of which appeared as a publication in a peer-reviewed journal. While the preprint servers identified 39.7% of the preprint–publication links, our linker identified 90.9% of the expected links with no clues taken from the preprint servers.

The accuracy of the proposed linker is 91.5% on this reference set, with 90.9% sensitivity and 91.9% specificity. This is a 16.26% increase in accuracy compared to that of preprint servers. We release this software as supplementary material to foster its integration into preprint servers’ workflows and enhance a daily preprint–publication chase that is useful to all readers, including systematic reviewers.

This preprint–publication linker currently provides day-to-day updates to the biomedical experts of the COVID-NMA initiative.

URL : Day-to-day discovery of preprint–publication links

DOI : https://doi.org/10.1007/s11192-021-03900-7

Thirteen Ways to Write an Abstract

Authors : James Hartley, Guillaume Cabanac

The abstract is a crucial component of a research article. Abstracts head the text—and sometimes they can appear alone in separate listings (e.g., conference proceedings). The purpose of the abstract is to inform the reader succinctly what the paper is about, why and how the research was carried out, and what conclusions might be drawn.

In this paper we consider the same (or a similar) abstract in 13 different formats to illustrate the strengths and weaknesses of each approach.

URL : Thirteen Ways to Write an Abstract

DOI : http://dx.doi.org/10.3390/publications5020011

Interroger le texte scientifique

Auteur/Author : Guillaume Cabanac

Les documents textuels sont des vecteurs d’information familiers et incontournables de notre société de l’information. Avec l’essor des plateformes numériques et des médias sociaux, le texte se décline désormais en pages web, billets de blogs, commentaires, tweets et tags, entre autres. Auparavant consommateurs passifs, les lecteurs se muent à leur tour en producteurs de contenus.

En résultent des échanges interpersonnels qui tissent des réseaux sociaux numériques s’étendant bien au-delà de nos cercles relationnels. Dans ce contexte, nature et format des textes, intentions de leurs auteurs (informer, rediffuser, critiquer, compléter, corriger, etc.), contexte spatio-temporel ainsi que véracité et fraîcheur variables des informations sont autant de subtilités à intégrer dans les modèles de recherche d’information.

La première partie de ce mémoire présente une synthèse de résultats en recherche d’information visant à modéliser ces facteurs pour améliorer la pertinence des recherches sur des corpus textuels, notamment issus de médias sociaux.

Le programme de recherche que je développe vise également à « interroger le texte » pour révéler des informations au sujet de son contenu, de ses auteurs et de ses lecteurs. Le texte scientifique a été choisi comme cible pour la richesse de son contenu et de ses méta- données. Ainsi, la deuxième partie du mémoire synthétise des résultats en scientométrie, terme désignant l’étude quantitative des sciences et de l’innovation.

Il s’est agi de questionner des textes scientifiques et les réseaux sous-jacents (lexique, références, auteurs, institutions, etc.) pour faire émerger des connaissances à forte valeur ajoutée et apporter un éclairage sur la création et la diffusion des savoirs scientifiques.

Les deux volets articulés dans ce mémoire concourent à définir un programme de recherche interdisciplinaire à la croisée de l’informatique, la scientométrie et la sociologie des sciences.

Son ambition consiste à interroger le texte scientifique pour en améliorer l’accès (via la recherche d’information) tout en contribuant à éliciter les ressorts de la genèse et de l’évolution des mondes sociaux et des savoirs en sciences (via la scientométrie).

URL : Interroger le texte scientifique

Alternative location : https://tel.archives-ouvertes.fr/tel-01413878/

Bibliogifts in LibGen? A study of a text-sharing platform driven by biblioleaks and crowdsourcing

Author : Guillaume Cabanac

Research articles disseminate the knowledge produced by the scientific community. Access to this literature is crucial for researchers and the general public. Apparently, “bibliogifts” are available online for free from text-sharing platforms.

However, little is known about such platforms. What is the size of the underlying digital libraries? What are the topics covered? Where do these documents originally come from? This article reports on a study of the Library Genesis platform (LibGen).

The 25 million documents (42 terabytes) it hosts and distributes for free are mostly research articles, textbooks, and books in English. The article collection stems from isolated, but massive, article uploads (71%) in line with a “biblioleaks” scenario, as well as from daily crowdsourcing (29%) by worldwide users of platforms such as Reddit Scholar and Sci-Hub.

By relating the DOIs registered at CrossRef and those cached at LibGen, this study reveals that 36% of all DOI articles are available for free at LibGen. This figure is even higher (68%) for three major publishers: Elsevier, Springer, and Wiley. More research is needed to understand to what extent researchers and the general public have recourse to such text-sharing platforms and why.

URL : http://www.irit.fr/publis/SIG/2015_JASIST_C.pdf