Scholar Metrics Scraper (SMS): automated retrieval of citation and author data

Authors : Yutong Cao, Nicole A. Cheung, Dean Giustini, Jeffrey LeDue, Timothy H. Murphy

Academic departments, research clusters and evaluators analyze author and citation data to measure research impact and to support strategic planning. We created Scholar Metrics Scraper (SMS) to automate the retrieval of bibliometric data for a group of researchers.

The project contains Jupyter notebooks that take a list of researchers as an input and exports a CSV file of citation metrics from Google Scholar (GS) to visualize the group’s impact and collaboration. A series of graph outputs are also available. SMS is an open solution for automating the retrieval and visualization of citation data.

URL : Scholar Metrics Scraper (SMS): automated retrieval of citation and author data

DOI : https://doi.org/10.3389/frma.2024.1335454

Why are these publications missing? Uncovering the reasons behind the exclusion of documents in free-access scholarly databases

Authors : Lorena Delgado-QuirósIsidro F. AguilloAlberto Martín-MartínEmilio Delgado López-CózarEnrique Orduña-MaleaJosé Luis Ortega

This study analyses the coverage of seven free-access bibliographic databases (Crossref, Dimensions—non-subscription version, Google Scholar, Lens, Microsoft Academic, Scilit, and Semantic Scholar) to identify the potential reasons that might cause the exclusion of scholarly documents and how they could influence coverage.

To do this, 116 k randomly selected bibliographic records from Crossref were used as a baseline. API endpoints and web scraping were used to query each database. The results show that coverage differences are mainly caused by the way each service builds their databases.

While classic bibliographic databases ingest almost the exact same content from Crossref (Lens and Scilit miss 0.1% and 0.2% of the records, respectively), academic search engines present lower coverage (Google Scholar does not find: 9.8%, Semantic Scholar: 10%, and Microsoft Academic: 12%). Coverage differences are mainly attributed to external factors, such as web accessibility and robot exclusion policies (39.2%–46%), and internal requirements that exclude secondary content (6.5%–11.6%).

In the case of Dimensions, the only classic bibliographic database with the lowest coverage (7.6%), internal selection criteria such as the indexation of full books instead of book chapters (65%) and the exclusion of secondary content (15%) are the main motives of missing publications.

URL : Why are these publications missing? Uncovering the reasons behind the exclusion of documents in free-access scholarly databases

DOI : https://doi.org/10.1002/asi.24839

Large coverage fluctuations in Google Scholar: a case study

Authors : Alberto Martín-Martín, Emilio Delgado López-Cózar

Unlike other academic bibliographic databases, Google Scholar intentionally operates in a way that does not maintain coverage stability: documents that stop being available to Google Scholar’s crawlers are removed from the system.

This can also affect Google Scholar’s citation graph (citation counts can decrease). Furthermore, because Google Scholar is not transparent about its coverage, the only way to directly observe coverage loss is through regular monitorization of Google Scholar data.

Because of this, few studies have empirically documented this phenomenon. This study analyses a large decrease in coverage of documents in the field of Astronomy and Astrophysics that took place in 2019 and its subsequent recovery, using longitudinal data from previous analyses and a new dataset extracted in 2020.

Documents from most of the larger publishers in the field disappeared from Google Scholar despite continuing to be available on the Web, which suggests an error on Google Scholar’s side. Disappeared documents did not reappear until the following index-wide update, many months after the problem was discovered.

The slowness with which Google Scholar is currently able to resolve indexing errors is a clear limitation of the platform both for literature search and bibliometric use cases.

URL : https://arxiv.org/abs/2102.07571

Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations

Authors : Alberto Martín-Martín, Mike Thelwall, Enrique Orduna-Malea, Emilio Delgado López-Cózar

New sources of citation data have recently become available, such as Microsoft Academic, Dimensions, and the OpenCitations Index of CrossRef open DOI-to-DOI citations (COCI). Although these have been compared to the Web of Science Core Collection (WoS), Scopus, or Google Scholar, there is no systematic evidence of their differences across subject categories.

In response, this paper investigates 3,073,351 citations found by these six data sources to 2,515 English-language highly-cited documents published in 2006 from 252 subject categories, expanding and updating the largest previous study. Google Scholar found 88% of all citations, many of which were not found by the other sources, and nearly all citations found by the remaining sources (89–94%).

A similar pattern held within most subject categories. Microsoft Academic is the second largest overall (60% of all citations), including 82% of Scopus citations and 86% of WoS citations. In most categories, Microsoft Academic found more citations than Scopus and WoS (182 and 223 subject categories, respectively), but had coverage gaps in some areas, such as Physics and some Humanities categories. After Scopus, Dimensions is fourth largest (54% of all citations), including 84% of Scopus citations and 88% of WoS citations.

It found more citations than Scopus in 36 categories, more than WoS in 185, and displays some coverage gaps, especially in the Humanities. Following WoS, COCI is the smallest, with 28% of all citations. Google Scholar is still the most comprehensive source. In many subject categories Microsoft Academic and Dimensions are good alternatives to Scopus and WoS in terms of coverage.

DOI : https://doi.org/10.1007/s11192-020-03690-4

Can Google Scholar and Mendeley help to assess the scholarly impacts of dissertations?

Authors : Kayvan Kousha, Mike Thelwall

Dissertations can be the single most important scholarly outputs of junior researchers. Whilst sets of journal articles are often evaluated with the help of citation counts from the Web of Science or Scopus, these do not index dissertations and so their impact is hard to assess.

In response, this article introduces a new multistage method to extract Google Scholar citation counts for large collections of dissertations from repositories indexed by Google.

The method was used to extract Google Scholar citation counts for 77,884 American doctoral dissertations from 2013-2017 via ProQuest, with a precision of over 95%. Some ProQuest dissertations that were dual indexed with other repositories could not be retrieved with ProQuest-specific searches but could be found with Google Scholar searches of the other repositories.

The Google Scholar citation counts were then compared with Mendeley reader counts, a known source of scholarly-like impact data. A fifth of the dissertations had at least one citation recorded in Google Scholar and slightly fewer had at least one Mendeley reader.

Based on numerical comparisons, the Mendeley reader counts seem to be more useful for impact assessment purposes for dissertations that are less than two years old, whilst Google Scholar citations are more useful for older dissertations, especially in social sciences, arts and humanities.

Google Scholar citation counts may reflect a more scholarly type of impact than that of Mendeley reader counts because dissertations attract a substantial minority of their citations from other dissertations.

In summary, the new method now makes it possible for research funders, institutions and others to systematically evaluate the impact of dissertations, although additional Google Scholar queries for other online repositories are needed to ensure comprehensive coverage.

URL : https://arxiv.org/abs/1902.08746

A Reflection on the Applicability of Google Scholar as a Tool for Comprehensive Retrieval in Bibliometric Research and Systematic Reviews

Authors : Mojgan Houshyar, Hajar Sotudeh

Google Scholar has recently attracted great attentions as an open access multidisciplinary citation database, and a tool for retrieving scientific works for scientometricians and researchers.

The present research intended to highlight the limitations brought about by efficiency policies of the search engine and its impact on the results available to users. To do so, it examined the accessibility of the retrieval results, through conducting 54 searches in this database.

The results showed that the estimation of the results on the top of the first page returned by Google Scholar did not match that of the accessible results. Therefore, these statistics could not be accounted for to precisely determine the number of documents on a topic.

Moreover, the results showed that although the subjects selected for the searches were very specific, the number of results for each search was very wide and exceeded the upper limit of 1,000 records authorized in Google Scholar for display.

By limiting the searches to the title field, the number of the results was dramatically reduced. Since title is one of the most important representations of document contents in scientific and technical fields, this strategy can increase the precision of the results and thus the effectiveness of the retrievals.

The investigation of the accessibility of the search results for the title field also showed that some documents, though scarce in number, were still inaccessible despite the fact that they were within the 1000-record limits.

In addition, in title field search, some rare cases of duplicate records, incompatibilities between queries and documents were observed regarding the language of the documents and exact phrase search.

The lack of automatic truncation in field searches was one of the most important issues necessitating the use of sophisticated search strategies.

URL : A Reflection on the Applicability of Google Scholar as a Tool for Comprehensive Retrieval in Bibliometric Research and Systematic Reviews

Alternative location : https://ijism.ricest.ac.ir/index.php/ijism/article/view/1267

Google Scholar as a data source for research assessment

Authors : Emilio Delgado López-Cózar, Enrique Orduna-Malea, Alberto Martín-Martín

The launch of Google Scholar (GS) marked the beginning of a revolution in the scientific information market. This search engine, unlike traditional databases, automatically indexes information from the academic web. Its ease of use, together with its wide coverage and fast indexing speed, have made it the first tool most scientists currently turn to when they need to carry out a literature search.

Additionally, the fact that its search results were accompanied from the beginning by citation counts, as well as the later development of secondary products which leverage this citation data (such as Google Scholar Metrics and Google Scholar Citations), made many scientists wonder about its potential as a source of data for bibliometric analyses.

The goal of this chapter is to lay the foundations for the use of GS as a supplementary source (and in some disciplines, arguably the best alternative) for scientific evaluation.

First, we present a general overview of how GS works. Second, we present empirical evidences about its main characteristics (size, coverage, and growth rate). Third, we carry out a systematic analysis of the main limitations this search engine presents as a tool for the evaluation of scientific performance.

Lastly, we discuss the main differences between GS and other more traditional bibliographic databases in light of the correlations found between their citation data. We conclude that Google Scholar presents a broader view of the academic world because it has brought to light a great amount of sources that were not previously visible.

URL : https://arxiv.org/abs/1806.04435