Evaluating the predictive capacity of ChatGPT for academic peer review outcomes across multiple platforms

Authors : Mike Thelwall, Abdallah Yaghi

Academic peer review is at the heart of scientific quality control, yet the process is slow and time-consuming. Technology that can predict peer review outcomes may help with this, for example by fast-tracking desk rejection decisions. While previous studies have demonstrated that Large Language Models (LLMs) can predict peer review outcomes to some extent, this paper introduces two new contexts and employs a more robust method—averaging multiple ChatGPT scores.

Averaging 30 ChatGPT predictions, based on reviewer guidelines and using only the submitted titles and abstracts failed to predict peer review outcomes for F1000Research (Spearman’s rho = 0.00). However, it produced mostly weak positive correlations with the quality dimensions of SciPost Physics (rho = 0.25 for validity, rho = 0.25 for originality, rho = 0.20 for significance, and rho = 0.08 for clarity) and a moderate positive correlation for papers from the International Conference on Learning Representations (ICLR) (rho = 0.38). Including article full texts increased the correlation for ICLR (rho = 0.46) and slightly improved it for F1000Research (rho = 0.09), with variable effects on the four quality dimension correlations for SciPost LaTeX files.

The use of simple chain-of-thought system prompts slightly increased the correlation for F1000Research (rho = 0.10), marginally reduced it for ICLR (rho = 0.37), and further decreased it for SciPost Physics (rho = 0.16 for validity, rho = 0.18 for originality, rho = 0.18 for significance, and rho = 0.05 for clarity). Overall, the results suggest that in some contexts, ChatGPT can produce weak pre-publication quality predictions.

However, their effectiveness and the optimal strategies for employing them vary considerably between platforms, journals, and conferences. Finally, the most suitable inputs for ChatGPT appear to differ depending on the platform.

URL : Evaluating the predictive capacity of ChatGPT for academic peer review outcomes across multiple platforms

DOI : https://doi.org/10.1007/s11192-025-05287-1

The Origins and Veracity of References ‘Cited’ by Generative Artificial Intelligence Applications: Implications for the Quality of Responses

AuthorDirk H. R. Spennemann

The public release of ChatGPT in late 2022 has resulted in considerable publicity and has led to widespread discussion of the usefulness and capabilities of generative Artificial intelligence (Ai) language models. Its ability to extract and summarise data from textual sources and present them as human-like contextual responses makes it an eminently suitable tool to answer questions users might ask.

Expanding on a previous analysis of the capabilities of ChatGPT3.5, this paper tested what archaeological literature appears to have been included in the training phase of three recent generative Ai language models: ChatGPT4o, ScholarGPT, and DeepSeek R1. While ChatGPT3.5 offered seemingly pertinent references, a large percentage proved to be fictitious. While the more recent model ScholarGPT, which is purportedly tailored towards academic needs, performed much better, it still offered a high rate of fictitious references compared to the general models ChatGPT4o and DeepSeek.

Using ‘cloze’ analysis to make inferences on the sources ‘memorized’ by a generative Ai model, this paper was unable to prove that any of the four genAi models had perused the full texts of the genuine references. It can be shown that all references provided by ChatGPT and other OpenAi models, as well as DeepSeek, that were found to be genuine, have also been cited on Wikipedia pages.

This strongly indicates that the source base for at least some, if not most, of the data is found in those pages and thus represents, at best, third-hand source material. This has significant implications in relation to the quality of the data available to generative Ai models to shape their answers. The implications of this are discussed.

URL : The Origins and Veracity of References ‘Cited’ by Generative Artificial Intelligence Applications: Implications for the Quality of Responses

DOI : https://doi.org/10.3390/publications13010012

‘As of my last knowledge update’: How is content generated by ChatGPT infiltrating scientific papers published in premier journals?

Author : Artur Strzelecki

The aim of this paper is to highlight the situation whereby content generated by the large language model ChatGPT is appearing in peer-reviewed papers in journals by recognized publishers. The paper demonstrates how to identify sections that indicate that a text fragment was generated, that is, entirely created, by ChatGPT. To prepare an illustrative compilation of papers that appear in journals indexed in the Web of Science and Scopus databases and possessing Impact Factor and CiteScore indicators, the SPAR4SLR method was used, which is mainly applied in systematic literature reviews.

Three main findings are presented: in highly regarded premier journals, articles appear that bear the hallmarks of the content generated by AI large language models, whose use was not declared by the authors (1); many of these identified papers are already receiving citations from other scientific works, also placed in journals found in scientific databases (2); and, most of the identified papers belong to the disciplines of medicine and computer science, but there are also articles that belong to disciplines such as environmental science, engineering, sociology, education, economics and management (3).

This paper aims to continue and add to the recently initiated discussion on the use of large language models like ChatGPT in the creation of scholarly works.

URL : ‘As of my last knowledge update’: How is content generated by ChatGPT infiltrating scientific papers published in premier journals?

DOI : https://doi.org/10.1002/leap.1650

Evaluating Research Quality with Large Language Models: An Analysis of ChatGPT’s Effectiveness with Different Settings and Inputs

Author : Mike Thelwall

Evaluating the quality of academic journal articles is a time consuming but critical task for national research evaluation exercises, appointments and promotion. It is therefore important to investigate whether Large Language Models (LLMs) can play a role in this process.

This article assesses which ChatGPT inputs (full text without tables, figures and references; title and abstract; title only) produce better quality score estimates, and the extent to which scores are affected by ChatGPT models and system prompts.

The results show that the optimal input is the article title and abstract, with average ChatGPT scores based on these (30 iterations on a dataset of 51 papers) correlating at 0.67 with human scores, the highest ever reported. ChatGPT 4o is slightly better than 3.5-turbo (0.66), and 4o-mini (0.66).

The results suggest that article full texts might confuse LLM research quality evaluations, even though complex system instructions for the task are more effective than simple ones.

Thus, whilst abstracts contain insufficient information for a thorough assessment of rigour, they may contain strong pointers about originality and significance. Finally, linear regression can be used to convert the model scores into the human scale scores, which is 31% more accurate than guessing.

Arxiv : https://arxiv.org/abs/2408.06752

Global insights: ChatGPT’s influence on academic and research writing, creativity, and plagiarism policies

Authors : Muhammad Abid Malik, Amjad Islam Amjad, Sarfraz Aslam, Abdulnaser Fakhrou

Introduction: The current study explored the influence of Chat Generative Pre-Trained Transformer (ChatGPT) on the concepts, parameters, policies, and practices of creativity and plagiarism in academic and research writing.

Methods: Data were collected from 10 researchers from 10 different countries (Australia, China, the UK, Brazil, Pakistan, Bangladesh, Iran, Nigeria, Trinidad and Tobago, and Turkiye) using semi-structured interviews. NVivo was employed for data analysis.

Results: Based on the responses, five themes about the influence of ChatGPT on academic and research writing were generated, i.e., opportunity, human assistance, thought-provoking, time-saving, and negative attitude. Although the researchers were mostly positive about it, some feared it would degrade their writing skills and lead to plagiarism. Many of them believed that ChatGPT would redefine the concepts, parameters, and practices of creativity and plagiarism.

Discussion: Creativity may no longer be restricted to the ability to write, but also to use ChatGPT or other large language models (LLMs) to write creatively. Some suggested that machine-generated text might be accepted as the new norm; however, using it without proper acknowledgment would be considered plagiarism. The researchers recommended allowing ChatGPT for academic and research writing; however, they strongly advised it to be regulated with limited use and proper acknowledgment.

URL : Global insights: ChatGPT’s influence on academic and research writing, creativity, and plagiarism policies

DOI : https://doi.org/10.3389/frma.2024.1486832

The use of ChatGPT for identifying disruptive papers in science: a first exploration

Authors : Lutz Bornmann, Lingfei Wu, Christoph Ettl

ChatGPT has arrived in quantitative research evaluation. With the exploration in this Letter to the Editor, we would like to widen the spectrum of the possible use of ChatGPT in bibliometrics by applying it to identify disruptive papers.

The identification of disruptive papers using publication and citation counts has become a popular topic in scientometrics. The disadvantage of the quantitative approach is its complexity in the computation. The use of ChatGPT might be an easy to use alternative.

URL : The use of ChatGPT for identifying disruptive papers in science: a first exploration

DOI : https://doi.org/10.1007/s11192-024-05176-z

Academic writing in the age of AI: Comparing the reliability of ChatGPT and Bard with Scopus and Web of Science

Authors : Swati Garg, Asad Ahmad, Dag Øivind Madsen

ChatGPT and Bard (now known as Gemini) are becoming indispensable resources for researchers, academicians and diverse stakeholders within the academic landscape. At the same time, traditional digital tools such as scholarly databases continue to be widely used. Web of Science and Scopus are the most extensive academic databases and are generally regarded as consistently reliable scholarly research resources. With the increasing acceptance of artificial intelligence (AI) in academic writing, this study focuses on understanding the reliability of the new AI models compared to Scopus and Web of Science.

The study includes a bibliometric analysis of green, sustainable and ecological buying behaviour, covering the period from 1 January 2011 to 21 May 2023. These results are used to compare the results from the AI and the traditional scholarly databases on several parameters. Overall, the findings suggest that AI models like ChatGPT and Bard are not yet reliable for academic writing tasks. It appears to be too early to depend on AI for such tasks.

URL : Academic writing in the age of AI: Comparing the reliability of ChatGPT and Bard with Scopus and Web of Science

DOI : https://doi.org/10.1016/j.jik.2024.100563