Why are these publications missing? Uncovering the reasons behind the exclusion of documents in free-access scholarly databases

Authors : Lorena Delgado-QuirósIsidro F. AguilloAlberto Martín-MartínEmilio Delgado López-CózarEnrique Orduña-MaleaJosé Luis Ortega

This study analyses the coverage of seven free-access bibliographic databases (Crossref, Dimensions—non-subscription version, Google Scholar, Lens, Microsoft Academic, Scilit, and Semantic Scholar) to identify the potential reasons that might cause the exclusion of scholarly documents and how they could influence coverage.

To do this, 116 k randomly selected bibliographic records from Crossref were used as a baseline. API endpoints and web scraping were used to query each database. The results show that coverage differences are mainly caused by the way each service builds their databases.

While classic bibliographic databases ingest almost the exact same content from Crossref (Lens and Scilit miss 0.1% and 0.2% of the records, respectively), academic search engines present lower coverage (Google Scholar does not find: 9.8%, Semantic Scholar: 10%, and Microsoft Academic: 12%). Coverage differences are mainly attributed to external factors, such as web accessibility and robot exclusion policies (39.2%–46%), and internal requirements that exclude secondary content (6.5%–11.6%).

In the case of Dimensions, the only classic bibliographic database with the lowest coverage (7.6%), internal selection criteria such as the indexation of full books instead of book chapters (65%) and the exclusion of secondary content (15%) are the main motives of missing publications.

URL : Why are these publications missing? Uncovering the reasons behind the exclusion of documents in free-access scholarly databases

DOI : https://doi.org/10.1002/asi.24839

Large coverage fluctuations in Google Scholar: a case study

Authors : Alberto Martín-Martín, Emilio Delgado López-Cózar

Unlike other academic bibliographic databases, Google Scholar intentionally operates in a way that does not maintain coverage stability: documents that stop being available to Google Scholar’s crawlers are removed from the system.

This can also affect Google Scholar’s citation graph (citation counts can decrease). Furthermore, because Google Scholar is not transparent about its coverage, the only way to directly observe coverage loss is through regular monitorization of Google Scholar data.

Because of this, few studies have empirically documented this phenomenon. This study analyses a large decrease in coverage of documents in the field of Astronomy and Astrophysics that took place in 2019 and its subsequent recovery, using longitudinal data from previous analyses and a new dataset extracted in 2020.

Documents from most of the larger publishers in the field disappeared from Google Scholar despite continuing to be available on the Web, which suggests an error on Google Scholar’s side. Disappeared documents did not reappear until the following index-wide update, many months after the problem was discovered.

The slowness with which Google Scholar is currently able to resolve indexing errors is a clear limitation of the platform both for literature search and bibliometric use cases.

URL : https://arxiv.org/abs/2102.07571

Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations

Authors : Alberto Martín-Martín, Mike Thelwall, Enrique Orduna-Malea, Emilio Delgado López-Cózar

New sources of citation data have recently become available, such as Microsoft Academic, Dimensions, and the OpenCitations Index of CrossRef open DOI-to-DOI citations (COCI). Although these have been compared to the Web of Science Core Collection (WoS), Scopus, or Google Scholar, there is no systematic evidence of their differences across subject categories.

In response, this paper investigates 3,073,351 citations found by these six data sources to 2,515 English-language highly-cited documents published in 2006 from 252 subject categories, expanding and updating the largest previous study. Google Scholar found 88% of all citations, many of which were not found by the other sources, and nearly all citations found by the remaining sources (89–94%).

A similar pattern held within most subject categories. Microsoft Academic is the second largest overall (60% of all citations), including 82% of Scopus citations and 86% of WoS citations. In most categories, Microsoft Academic found more citations than Scopus and WoS (182 and 223 subject categories, respectively), but had coverage gaps in some areas, such as Physics and some Humanities categories. After Scopus, Dimensions is fourth largest (54% of all citations), including 84% of Scopus citations and 88% of WoS citations.

It found more citations than Scopus in 36 categories, more than WoS in 185, and displays some coverage gaps, especially in the Humanities. Following WoS, COCI is the smallest, with 28% of all citations. Google Scholar is still the most comprehensive source. In many subject categories Microsoft Academic and Dimensions are good alternatives to Scopus and WoS in terms of coverage.

DOI : https://doi.org/10.1007/s11192-020-03690-4

Unbundling Open Access dimensions: a conceptual discussion to reduce terminology inconsistencies

Authors : Alberto Martín-Martín, Rodrigo Costas, Thed N. van Leeuwen, Emilio Delgado López-Cózar

The current ways in which documents are made freely accessible in the Web no longer adhere to the models established Budapest/Bethesda/Berlin (BBB) definitions of Open Access (OA). Since those definitions were established, OA-related terminology has expanded, trying to keep up with all the variants of OA publishing that are out there.

However, the inconsistent and arbitrary terminology that is being used to refer to these variants are complicating communication about OA-related issues. This study intends to initiate a discussion on this issue, by proposing a conceptual model of OA.

Our model features six different dimensions (authoritativeness, user rights, stability, immediacy, peer-review, and cost). Each dimension allows for a range of different options. We believe that by combining the options in these six dimensions, we can arrive at all the current variants of OA, while avoiding ambiguous and/or arbitrary terminology.

This model can be an useful tool for funders and policy makers who need to decide exactly which aspects of OA are necessary for each specific scenario.

URL : Unbundling Open Access dimensions: a conceptual discussion to reduce terminology inconsistencies

Alternative location : https://arxiv.org/abs/1806.05029

Google Scholar as a data source for research assessment

Authors : Emilio Delgado López-Cózar, Enrique Orduna-Malea, Alberto Martín-Martín

The launch of Google Scholar (GS) marked the beginning of a revolution in the scientific information market. This search engine, unlike traditional databases, automatically indexes information from the academic web. Its ease of use, together with its wide coverage and fast indexing speed, have made it the first tool most scientists currently turn to when they need to carry out a literature search.

Additionally, the fact that its search results were accompanied from the beginning by citation counts, as well as the later development of secondary products which leverage this citation data (such as Google Scholar Metrics and Google Scholar Citations), made many scientists wonder about its potential as a source of data for bibliometric analyses.

The goal of this chapter is to lay the foundations for the use of GS as a supplementary source (and in some disciplines, arguably the best alternative) for scientific evaluation.

First, we present a general overview of how GS works. Second, we present empirical evidences about its main characteristics (size, coverage, and growth rate). Third, we carry out a systematic analysis of the main limitations this search engine presents as a tool for the evaluation of scientific performance.

Lastly, we discuss the main differences between GS and other more traditional bibliographic databases in light of the correlations found between their citation data. We conclude that Google Scholar presents a broader view of the academic world because it has brought to light a great amount of sources that were not previously visible.

URL : https://arxiv.org/abs/1806.04435

The counting house: measuring those who count. Presence of Bibliometrics, Scientometrics, Informetrics, Webometrics and Altmetrics in the Google Scholar Citations, ResearcherID, ResearchGate, Mendeley & Twitter

Authors : Alberto Martin-Martin, Enrique Orduna-Malea, Juan M. Ayllon, Emilio Delgado Lopez-Cozar

Following in the footsteps of the model of scientific communication, which has recently gone through a metamorphosis (from the Gutenberg galaxy to the Web galaxy), a change in the model and methods of scientific evaluation is also taking place.

A set of new scientific tools are now providing a variety of indicators which measure all actions and interactions among scientists in the digital space, making new aspects of scientific communication emerge.

In this work we present a method for capturing the structure of an entire scientific community (the Bibliometrics, Scientometrics, Informetrics, Webometrics, and Altmetrics community) and the main agents that are part of it (scientists, documents, and sources) through the lens of Google Scholar Citations.

Additionally, we compare these author portraits to the ones offered by other profile or social platforms currently used by academics (ResearcherID, ResearchGate, Mendeley, and Twitter), in order to test their degree of use, completeness, reliability, and the validity of the information they provide.

A sample of 814 authors (researchers in Bibliometrics with a public profile created in Google Scholar Citations was subsequently searched in the other platforms, collecting the main indicators computed by each of them.

The data collection was carried out on September, 2015. The Spearman correlation was applied to these indicators (a total of 31) , and a Principal Component Analysis was carried out in order to reveal the relationships among metrics and platforms as well as the possible existence of metric cluster.

URL : https://arxiv.org/abs/1602.02412