Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT’s Effectiveness with Different Settings and Inputs

Author : Mike Thelwall

Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process.

This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts.

The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66).

The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones.

Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.

Arxiv : https://arxiv.org/abs/2408.06752

In which fields are citations indicators of research quality?

Authors : Mike Thelwall, Kayvan Kousha, Emma Stuart, Meiko Makita, Mahshid Abdoli, Paul Wilson, Jonathan Levitt

Citation counts are widely used as indicators of research quality to support or replace human peer review and for lists of top cited papers, researchers, and institutions. Nevertheless, the relationship between citations and research quality is poorly evidenced. We report the first large-scale science-wide academic evaluation of the relationship between research quality and citations (field normalized citation counts), correlating them for 87,739 journal articles in 34 field-based UK Units of Assessment (UoA).

The two correlate positively in all academic fields, from very weak (0.1) to strong (0.5), reflecting broadly linear relationships in all fields. We give the first evidence that the correlations are positive even across the arts and humanities. The patterns are similar for the field classification schemes of Scopus and Dimensions.ai, although varying for some individual subjects and therefore more uncertain for these.

We also show for the first time that no field has a citation threshold beyond which all articles are excellent quality, so lists of top cited articles are not pure collections of excellence, and neither is any top citation percentile indicator. Thus, while appropriately field normalized citations associate positively with research quality in all fields, they never perfectly reflect it, even at high values.

URL : In which fields are citations indicators of research quality?

DOI : https://doi.org/10.1002/asi.24767

Do altmetric scores reflect article quality? Evidence from the UK Research Excellence Framework 2021

Authors : Mike Thelwall, Kayvan Kousha, Mahshid Abdoli, Emma Stuart, Meiko Makita, Paul Wilson, Jonathan Levitt

Altmetrics are web-based quantitative impact or attention indicators for academic articles that have been proposed to supplement citation counts. This article reports the first assessment of the extent to which mature altmetrics from Altmetric.com and Mendeley associate with individual article quality scores.

It exploits expert norm-referenced peer review scores from the UK Research Excellence Framework 2021 for 67,030+ journal articles in all fields 2014–2017/2018, split into 34 broadly field-based Units of Assessment (UoAs). Altmetrics correlated more strongly with research quality than previously found, although less strongly than raw and field normalized Scopus citation counts.

Surprisingly, field normalizing citation counts can reduce their strength as a quality indicator for articles in a single field. For most UoAs, Mendeley reader counts are the best altmetric (e.g., three Spearman correlations with quality scores above 0.5), tweet counts are also a moderate strength indicator in eight UoAs (Spearman correlations with quality scores above 0.3), ahead of news (eight correlations above 0.3, but generally weaker), blogs (five correlations above 0.3), and Facebook (three correlations above 0.3) citations, at least in the United Kingdom.

In general, altmetrics are the strongest indicators of research quality in the health and physical sciences and weakest in the arts and humanities.

URL : Do altmetric scores reflect article quality? Evidence from the UK Research Excellence Framework 2021

Original location : https://asistdl.onlinelibrary.wiley.com/doi/full/10.1002/asi.24751

Researchers’ attitudes towards the h-index on Twitter 2007–2020: criticism and acceptance

Authors : Mike Thelwall, Kayvan Kousha

The h-index is an indicator of the scientific impact of an academic publishing career. Its hybrid publishing/citation nature and inherent bias against younger researchers, women, people in low resourced countries, and those not prioritizing publishing arguably give it little value for most formal and informal research evaluations.

Nevertheless, it is well-known by academics, used in some promotion decisions, and is prominent in bibliometric databases, such as Google Scholar. In the context of this apparent conflict, it is important to understand researchers’ attitudes towards the h-index.

This article used public tweets in English to analyse how scholars discuss the h-index in public: is it mentioned, are tweets about it positive or negative, and has interest decreased since its shortcomings were exposed?

The January 2021 Twitter Academic Research initiative was harnessed to download all English tweets mentioning the h-index from the 2006 start of Twitter until the end of 2020. The results showed a constantly increasing number of tweets.

Whilst the most popular tweets unapologetically used the h-index as an indicator of research performance, 28.5% of tweets were critical of its simplistic nature and others joked about it (8%). The results suggest that interest in the h-index is still increasing online despite scientists willing to evaluate the h-index in public tending to be critical.

Nevertheless, in limited situations it may be effective at succinctly conveying the message that a researcher has had a successful publishing career.

DOI : https://doi.org/10.1007/s11192-021-03961-8

Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: a multidisciplinary comparison of coverage via citations

Authors : Alberto Martín-Martín, Mike Thelwall, Enrique Orduna-Malea, Emilio Delgado López-Cózar

New sources of citation data have recently become available, such as Microsoft Academic, Dimensions, and the OpenCitations Index of CrossRef open DOI-to-DOI citations (COCI). Although these have been compared to the Web of Science Core Collection (WoS), Scopus, or Google Scholar, there is no systematic evidence of their differences across subject categories.

In response, this paper investigates 3,073,351 citations found by these six data sources to 2,515 English-language highly-cited documents published in 2006 from 252 subject categories, expanding and updating the largest previous study. Google Scholar found 88% of all citations, many of which were not found by the other sources, and nearly all citations found by the remaining sources (89–94%).

A similar pattern held within most subject categories. Microsoft Academic is the second largest overall (60% of all citations), including 82% of Scopus citations and 86% of WoS citations. In most categories, Microsoft Academic found more citations than Scopus and WoS (182 and 223 subject categories, respectively), but had coverage gaps in some areas, such as Physics and some Humanities categories. After Scopus, Dimensions is fourth largest (54% of all citations), including 84% of Scopus citations and 88% of WoS citations.

It found more citations than Scopus in 36 categories, more than WoS in 185, and displays some coverage gaps, especially in the Humanities. Following WoS, COCI is the smallest, with 28% of all citations. Google Scholar is still the most comprehensive source. In many subject categories Microsoft Academic and Dimensions are good alternatives to Scopus and WoS in terms of coverage.

DOI : https://doi.org/10.1007/s11192-020-03690-4

How common are explicit research questions in journal articles?

Authors : Mike Thelwall, Amalia Mas-Bleda

Although explicitly labeled research questions seem to be central to some fields, others do not need them.

This may confuse authors, editors, readers, and reviewers of multidisciplinary research. This article assesses the extent to which research questions are explicitly mentioned in 17 out of 22 areas of scholarship from 2000 to 2018 by searching over a million full-text open access journal articles. Research questions were almost never explicitly mentioned (under 2%) by articles in engineering and physical, life, and medical sciences, and were the exception (always under 20%) for the broad fields in which they were least rare: computing, philosophy, theology, and social sciences. Nevertheless, research questions were increasingly mentioned explicitly in all fields investigated, despite a rate of 1.8% overall (1.1% after correcting for irrelevant matches).

Other terminology for an article’s purpose may be more widely used instead, including aims, objectives, goals, hypotheses, and purposes, although no terminology occurs in a majority of articles in any broad field tested. Authors, editors, readers, and reviewers should therefore be aware that the use of explicitly labeled research questions or other explicit research purpose terminology is non-standard in most or all broad fields, although it is becoming less rare.

URL : How common are explicit research questions in journal articles?

Original location : https://www.mitpressjournals.org/doi/abs/10.1162/qss_a_00041?af=R&

Does the use of open, non-anonymous peer review in scholarly publishing introduce bias? Evidence from the F1000 post-publication open peer review publishing model

Authors : Mike Thelwall, Verena Weigert, Liz Allen, Zena Nyakoojo, Eleanor-Rose Papas

This study examines whether there is any evidence of bias in two areas of common critique of open, non-anonymous peer review – and used in the post-publication, peer review system operated by the open-access scholarly publishing platform F1000Research.

First, is there evidence of bias where a reviewer based in a specific country assesses the work of an author also based in the same country? Second, are reviewers influenced by being able to see the comments and know the origins of previous reviewer?

Scrutinising the open peer review comments published on F1000Research, we assess the extent of two frequently cited potential influences on reviewers that may be the result of the transparency offered by a fully attributable, open peer review publishing model: the national affiliations of authors and reviewers, and the ability of reviewers to view previously-published reviewer reports before submitting their own.

The effects of these potential influences were investigated for all first versions of articles published by 8 July 2019 to F1000Research. In 16 out of the 20 countries with the most articles, there was a tendency for reviewers based in the same country to give a more positive review.

The difference was statistically significant in one. Only 3 countries had the reverse tendency. Second, there is no evidence of a conformity bias. When reviewers mentioned a previous review in their peer review report, they were not more likely to give the same overall judgement.

Although reviewers who had longer to potentially read a previously published reviewer reports were slightly less likely to agree with previous reviewer judgements, this could be due to these articles being difficult to judge rather than deliberate non-conformity.

URL : https://arxiv.org/abs/1911.03379