Evaluating the predictive capacity of ChatGPT for academic peer review outcomes across multiple platforms

Authors : Mike Thelwall, Abdallah Yaghi

Academic peer review is at the heart of scientific quality control, yet the process is slow and time-consuming. Technology that can predict peer review outcomes may help with this, for example by fast-tracking desk rejection decisions. While previous studies have demonstrated that Large Language Models (LLMs) can predict peer review outcomes to some extent, this paper introduces two new contexts and employs a more robust method—averaging multiple ChatGPT scores.

Averaging 30 ChatGPT predictions, based on reviewer guidelines and using only the submitted titles and abstracts failed to predict peer review outcomes for F1000Research (Spearman’s rho = 0.00). However, it produced mostly weak positive correlations with the quality dimensions of SciPost Physics (rho = 0.25 for validity, rho = 0.25 for originality, rho = 0.20 for significance, and rho = 0.08 for clarity) and a moderate positive correlation for papers from the International Conference on Learning Representations (ICLR) (rho = 0.38). Including article full texts increased the correlation for ICLR (rho = 0.46) and slightly improved it for F1000Research (rho = 0.09), with variable effects on the four quality dimension correlations for SciPost LaTeX files.

The use of simple chain-of-thought system prompts slightly increased the correlation for F1000Research (rho = 0.10), marginally reduced it for ICLR (rho = 0.37), and further decreased it for SciPost Physics (rho = 0.16 for validity, rho = 0.18 for originality, rho = 0.18 for significance, and rho = 0.05 for clarity). Overall, the results suggest that in some contexts, ChatGPT can produce weak pre-publication quality predictions.

However, their effectiveness and the optimal strategies for employing them vary considerably between platforms, journals, and conferences. Finally, the most suitable inputs for ChatGPT appear to differ depending on the platform.

URL : Evaluating the predictive capacity of ChatGPT for academic peer review outcomes across multiple platforms

DOI : https://doi.org/10.1007/s11192-025-05287-1

Improving peer review of systematic reviews and related review types by involving librarians and information specialists as methodological peer reviewers: a randomised controlled trial

Authors : Melissa L Rethlefsen, Sara Schroter, Lex M Bouter, Jamie J Kirkham,  David Moher, Ana Patricia Ayala, David Blanco, Tara J Brigham, Holly K Grossetta Nardini,  Shona Kirtley, Kate Nyhan, Whitney Townsend, Maurice Zeegers

Objective

To evaluate the impact of adding librarians and information specialists (LIS) as methodological peer reviewers to the formal journal peer review process on the quality of search reporting and risk of bias in systematic review searches in the medical literature.

Design

Pragmatic two-group parallel randomised controlled trial.

Setting

Three biomedical journals.

Participants

Systematic reviews and related evidence synthesis manuscripts submitted to The BMJ, BMJ Open and BMJ Medicine and sent out for peer review from 3 January 2023 to 1 September 2023. Randomisation (allocation ratio, 1:1) was stratified by journal and used permuted blocks (block size=4). Of 2670 manuscripts sent to peer review during study enrollment, 400 met inclusion criteria and were randomised (62 The BMJ, 334 BMJ Open, 4 BMJ Medicine). 76 manuscripts were revised and resubmitted in the intervention group and 90 in the control group by 2 January 2024.

Interventions

All manuscripts followed usual journal practice for peer review, but those in the intervention group had an additional (LIS) peer reviewer invited.

Main outcome measures

The primary outcomes are the differences in first revision manuscripts between intervention and control groups in the quality of reporting and risk of bias. Quality of reporting was measured using four prespecified PRISMA-S items. Risk of bias was measured using ROBIS Domain 2. Assessments were done in duplicate and assessors were blinded to group allocation. Secondary outcomes included differences between groups for each individual PRISMA-S and ROBIS Domain 2 item. The difference in the proportion of manuscripts rejected as the first decision post-peer review between the intervention and control groups was an additional outcome.

Results

Differences in the proportion of adequately reported searches (4.4% difference, 95% CI: −2.0% to 10.7%) and risk of bias in searches (0.5% difference, 95% CI: −13.7% to 14.6%) showed no statistically significant differences between groups. By 4 months post-study, 98 intervention and 70 control group manuscripts had been rejected after peer review (13.8% difference, 95% CI: 3.9% to 23.8%).

Conclusions

Inviting LIS peer reviewers did not impact adequate reporting or risk of bias of searches in first revision manuscripts of biomedical systematic reviews and related review types, though LIS peer reviewers may have contributed to a higher rate of rejection after peer review.

URL : Improving peer review of systematic reviews and related review types by involving librarians and information specialists as methodological peer reviewers: a randomised controlled trial

DOI : https://doi.org/10.1136/bmjebm-2024-113527

Enhancing Research Methodology and Academic Publishing: A Structured Framework for Quality and Integrity

Authors : Md. Jalil Piran, Nguyen H. Tran

Following a brief introduction to research, research processes, research types, papers, reviews, and evaluations, this paper presents a structured framework for addressing inconsistencies in research methodology, technical writing, quality assessment, and publication standards across academic disciplines. Using a four-dimensional evaluation model that focuses on 1) technical content, 2) structural coherence, 3) writing precision, and 4) ethical integrity, this framework not only standardizes review and publication processes but also serves as a practical guide for authors in preparing high-quality manuscripts. Each of these four dimensions cannot be compromised for the sake of another.

Following that, we discuss the components of a research paper adhering to the four-dimensional evaluation model in detail by providing guidelines and principles. By aligning manuscripts with journal standards, reducing review bias, and enhancing transparency, the framework contributes to more reliable and reproducible research results. Moreover, by strengthening cross-disciplinary credibility, improving publication consistency, and fostering public trust in academic literature, this initiative is expected to positively influence both research quality and scholarly publishing’s reputation.

Arxiv : https://arxiv.org/abs/2412.05683

Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT’s Effectiveness with Different Settings and Inputs

Author : Mike Thelwall

Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process.

This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts.

The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66).

The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones.

Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.

Arxiv : https://arxiv.org/abs/2408.06752

Fundamental problems in the peer-review process and stakeholders’ perceptions of potential suggestions for improvement

Authors : Cigdem Kadaifci, Erkan Isikli, Y. Ilker Topcu

Academic papers are essential for researchers to communicate their work to their peers and industry experts. Quality research is published in prestigious scientific journals, and is considered as part of the hiring and promotion criteria at leading universities. Scientific journals conduct impartial and anonymous peer reviews of submitted manuscripts; however, individuals involved in this process may encounter issues related to the duration, impartiality, and transparency of these reviews.

To explore these concerns, we created a questionnaire based on a comprehensive review of related literature and expert opinions, which was distributed to all stakeholders (authors, reviewers, and editors) who participated in the peer-review process from a variety of countries and disciplines. Their opinions on the primary issues during the process and suggestions for improvement were collected. The data were then analysed based on various groups, such as gender, country of residence, and contribution type, using appropriate multivariate statistical techniques to determine the perceptions and experiences of participants in the peer-review process.

The results showed that unethical behaviour was not uncommon and that editors and experienced reviewers encountered it more frequently. Women and academics from Türkiye were more likely to experience ethical violations and perceived them as more ethically severe. Incentives and stakeholder involvement were seen as ways to enhance the quality and impartiality of peer review. The scale developed can serve as a useful tool for addressing difficulties in the peer-review process and improving its effectiveness and performance.

URL : Fundamental problems in the peer-review process and stakeholders’ perceptions of potential suggestions for improvement

DOI : https://doi.org/10.1002/leap.1637

Peer Reviews of Peer Reviews: A Randomized Controlled Trial and Other Experiments

Authors : Alexander Goldberg, Ivan Stelmakh, Kyunghyun Cho, Alice Oh, Alekh Agarwal, Danielle Belgrave, Nihar B. Shah

Is it possible to reliably evaluate the quality of peer reviews? We study this question driven by two primary motivations — incentivizing high-quality reviewing using assessed quality of reviews and measuring changes to review quality in experiments. We conduct a large scale study at the NeurIPS 2022 conference, a top-tier conference in machine learning, in which we invited (meta)-reviewers and authors to evaluate reviews given to submitted papers.

First, we conduct a RCT to examine bias due to the length of reviews. We generate elongated versions of reviews by adding substantial amounts of non-informative content. Participants in the control group evaluate the original reviews, whereas participants in the experimental group evaluate the artificially lengthened versions.

We find that lengthened reviews are scored (statistically significantly) higher quality than the original reviews. In analysis of observational data we find that authors are positively biased towards reviews recommending acceptance of their own papers, even after controlling for confounders of review length, quality, and different numbers of papers per author.

We also measure disagreement rates between multiple evaluations of the same review of 28%-32%, which is comparable to that of paper reviewers at NeurIPS. Further, we assess the amount of miscalibration of evaluators of reviews using a linear model of quality scores and find that it is similar to estimates of miscalibration of paper reviewers at NeurIPS.

Finally, we estimate the amount of variability in subjective opinions around how to map individual criteria to overall scores of review quality and find that it is roughly the same as that in the review of papers. Our results suggest that the various problems that exist in reviews of papers — inconsistency, bias towards irrelevant factors, miscalibration, subjectivity — also arise in reviewing of reviews.

Arxiv : https://arxiv.org/abs/2311.09497

Enhancing peer review efficiency: A mixed-methods analysis of artificial intelligence-assisted reviewer selection across academic disciplines

Author : Shai Farber

This mixed-methods study evaluates the efficacy of artificial intelligence (AI)-assisted reviewer selection in academic publishing across diverse disciplines. Twenty journal editors assessed AI-generated reviewer recommendations for a manuscript. The AI system achieved a 42% overlap with editors’ selections and demonstrated a significant improvement in time efficiency, reducing selection time by 73%.

Editors found that 37% of AI-suggested reviewers who were not part of their initial selection were indeed suitable. The system’s performance varied across disciplines, with higher accuracy in STEM fields (Cohen’s d = 0.68). Qualitative feedback revealed an appreciation for the AI’s ability to identify lesser-known experts but concerns about its grasp of interdisciplinary work. Ethical considerations, including potential algorithmic bias and privacy issues, were highlighted.

The study concludes that while AI shows promise in enhancing reviewer selection efficiency and broadening the reviewer pool, it requires human oversight to address limitations in understanding nuanced disciplinary contexts. Future research should focus on larger-scale longitudinal studies and developing ethical frameworks for AI integration in peer-review processes.

URL : Enhancing peer review efficiency: A mixed-methods analysis of artificial intelligence-assisted reviewer selection across academic disciplines

DOI : https://doi.org/10.1002/leap.1638