Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT’s Effectiveness with Different Settings and Inputs

Author : Mike Thelwall

Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process.

This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts.

The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66).

The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones.

Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.

Arxiv : https://arxiv.org/abs/2408.06752

Fundamental problems in the peer-review process and stakeholders’ perceptions of potential suggestions for improvement

Authors : Cigdem Kadaifci, Erkan Isikli, Y. Ilker Topcu

Academic papers are essential for researchers to communicate their work to their peers and industry experts. Quality research is published in prestigious scientific journals, and is considered as part of the hiring and promotion criteria at leading universities. Scientific journals conduct impartial and anonymous peer reviews of submitted manuscripts; however, individuals involved in this process may encounter issues related to the duration, impartiality, and transparency of these reviews.

To explore these concerns, we created a questionnaire based on a comprehensive review of related literature and expert opinions, which was distributed to all stakeholders (authors, reviewers, and editors) who participated in the peer-review process from a variety of countries and disciplines. Their opinions on the primary issues during the process and suggestions for improvement were collected. The data were then analysed based on various groups, such as gender, country of residence, and contribution type, using appropriate multivariate statistical techniques to determine the perceptions and experiences of participants in the peer-review process.

The results showed that unethical behaviour was not uncommon and that editors and experienced reviewers encountered it more frequently. Women and academics from Türkiye were more likely to experience ethical violations and perceived them as more ethically severe. Incentives and stakeholder involvement were seen as ways to enhance the quality and impartiality of peer review. The scale developed can serve as a useful tool for addressing difficulties in the peer-review process and improving its effectiveness and performance.

URL : Fundamental problems in the peer-review process and stakeholders’ perceptions of potential suggestions for improvement

DOI : https://doi.org/10.1002/leap.1637

Peer Reviews of Peer Reviews: A Randomized Controlled Trial and Other Experiments

Authors : Alexander Goldberg, Ivan Stelmakh, Kyunghyun Cho, Alice Oh, Alekh Agarwal, Danielle Belgrave, Nihar B. Shah

Is it possible to reliably evaluate the quality of peer reviews? We study this question driven by two primary motivations — incentivizing high-quality reviewing using assessed quality of reviews and measuring changes to review quality in experiments. We conduct a large scale study at the NeurIPS 2022 conference, a top-tier conference in machine learning, in which we invited (meta)-reviewers and authors to evaluate reviews given to submitted papers.

First, we conduct a RCT to examine bias due to the length of reviews. We generate elongated versions of reviews by adding substantial amounts of non-informative content. Participants in the control group evaluate the original reviews, whereas participants in the experimental group evaluate the artificially lengthened versions.

We find that lengthened reviews are scored (statistically significantly) higher quality than the original reviews. In analysis of observational data we find that authors are positively biased towards reviews recommending acceptance of their own papers, even after controlling for confounders of review length, quality, and different numbers of papers per author.

We also measure disagreement rates between multiple evaluations of the same review of 28%-32%, which is comparable to that of paper reviewers at NeurIPS. Further, we assess the amount of miscalibration of evaluators of reviews using a linear model of quality scores and find that it is similar to estimates of miscalibration of paper reviewers at NeurIPS.

Finally, we estimate the amount of variability in subjective opinions around how to map individual criteria to overall scores of review quality and find that it is roughly the same as that in the review of papers. Our results suggest that the various problems that exist in reviews of papers — inconsistency, bias towards irrelevant factors, miscalibration, subjectivity — also arise in reviewing of reviews.

Arxiv : https://arxiv.org/abs/2311.09497

Enhancing peer review efficiency: A mixed-methods analysis of artificial intelligence-assisted reviewer selection across academic disciplines

Author : Shai Farber

This mixed-methods study evaluates the efficacy of artificial intelligence (AI)-assisted reviewer selection in academic publishing across diverse disciplines. Twenty journal editors assessed AI-generated reviewer recommendations for a manuscript. The AI system achieved a 42% overlap with editors’ selections and demonstrated a significant improvement in time efficiency, reducing selection time by 73%.

Editors found that 37% of AI-suggested reviewers who were not part of their initial selection were indeed suitable. The system’s performance varied across disciplines, with higher accuracy in STEM fields (Cohen’s d = 0.68). Qualitative feedback revealed an appreciation for the AI’s ability to identify lesser-known experts but concerns about its grasp of interdisciplinary work. Ethical considerations, including potential algorithmic bias and privacy issues, were highlighted.

The study concludes that while AI shows promise in enhancing reviewer selection efficiency and broadening the reviewer pool, it requires human oversight to address limitations in understanding nuanced disciplinary contexts. Future research should focus on larger-scale longitudinal studies and developing ethical frameworks for AI integration in peer-review processes.

URL : Enhancing peer review efficiency: A mixed-methods analysis of artificial intelligence-assisted reviewer selection across academic disciplines

DOI : https://doi.org/10.1002/leap.1638

Status of peer review guidelines in international surgical journals: A cross-sectional survey

Authors : Min DongWenjing WangXuemei LiuFang LeiYunmei Luo

Aim

To gain insight into the current status of peer review guidelines in international surgical journals and to offer guidance for the development of peer review guidelines for surgical journals.

Methods

We selected the top 100 journals in the category of ‘Surgery’ according to the Journal Citation Report 2021. We conducted a search of the websites of these journals, and Web of Science, PubMed, other databases, in order to gather the peer review guidelines published by these top 100 journals up until June 30, 2022. Additionally, we analysed the contents of these peer review guidelines.

Results

Only 52% (52/100) of journals provided guidelines for reviewers. Sixteen peer review guidelines which were published by these 52 surgical journals were included in this study. The contents of these peer review guidelines were classified into 33 items. The most common item was research methodology, which was mentioned by 13 journals (25%, 13/52). Other important items include statistical methodology, mentioned by 11 journals (21.2%), the rationality of figures, tables, and data, mentioned by 11 journals (21.2%), innovation of research, mentioned by nine journals (17.3%), and language expression, readability of papers, ethical review, references, and so forth, mentioned by eight journals (15.4%).

Two journals described items for quality assessment of peer review. Forty-three journals offered a checklist to guide reviewers on how to write a review report. Some surgical journals developed peer review guidelines for reviewers with different academic levels, such as professional reviewers and patient/public reviewers. Additionally, some surgical journals provided specific items for different types of papers, such as original articles, reviews, surgical videos, surgical database research, surgery-related outcome measurements, and case reports in their peer review guidelines.

Conclusions

Key contents of peer review guidelines for the reviewers of surgical journals not only include items relating to reviewing research methodology, statistical methods, figures, tables and data, research innovation, ethical review, but also cover items concerning reviewing surgical videos, surgical database research, surgery-related outcome measurements, instructions on how to write a review report, and guidelines on how to assess quality of peer review.

URL : Status of peer review guidelines in international surgical journals: A cross-sectional survey

DOI : https://doi.org/10.1002/leap.1624

Evolution of Peer Review in Scientific Communication

Author : Dmitry Kochetkov

It is traditionally believed that peer review is the backbone of an academic journal and scientific communication, ensuring high quality and trust in the published materials. However, peer review only became an institutionalized practice in the second half of the 20th century, although the first scientific journals appeared three centuries earlier. By the beginning of the 21st century, there emerged an opinion that the traditional model of peer review is in deep crisis.

The aim of this article is to formulate a perspective model of peer review for scientific communication. The article discusses the evolution of the institution of scientific peer review and the formation of the current crisis. The author analyzed the modern landscape of innovations in peer review and scientific communication. Based on this analysis, three main peer review models in relation to editorial workflow were identified: pre-publication peer review (traditional model),  registered reports, and post-publication (peer) review (including preprints (peer) review).

The author argues that the third model offers the best way to implement the main functions of scientific communication.

URL : Evolution of Peer Review in Scientific Communication

DOI : https://doi.org/10.31235/osf.io/b2ra3

On The Peer Review Reports: Does Size Matter?

Authors : Abdelghani Maddi, Luis Miotti

Amidst the ever-expanding realm of scientific production and the proliferation of predatory journals, the focus on peer review remains paramount for scientometricians and sociologists of science. Despite this attention, there is a notable scarcity of empirical investigations into the tangible impact of peer review on publication quality.

This study aims to address this gap by conducting a comprehensive analysis of how peer review contributes to the quality of scholarly publications, as measured by the citations they receive. Utilizing an adjusted dataset comprising 57,482 publications from Publons to Web of Science and employing the Raking Ratio method, our study reveals intriguing insights. Specifically, our findings shed light on a nuanced relationship between the length of reviewer reports and the subsequent citations received by publications.

Through a robust regression analysis, we establish that, beginning from 947 words, the length of reviewer reports is significantly associated with an increase in citations. These results not only confirm the initial hypothesis that longer reports indicate requested improvements, thereby enhancing the quality and visibility of articles, but also underscore the importance of timely and comprehensive reviewer reports.

Furthermore, insights from Publons’ data suggest that open access to reports can influence reviewer behavior, encouraging more detailed reports. Beyond the scholarly landscape, our findings prompt a reevaluation of the role of reviewers, emphasizing the need to recognize and value this resource-intensive yet underappreciated activity in institutional evaluations.

Additionally, the study sounds a cautionary note regarding the challenges faced by peer review in the context of an increasing volume of submissions, potentially compromising the vigilance of peers in swiftly assessing numerous articles.

HAL : https://cnrs.hal.science/hal-04492274