PreprintMatch: A tool for preprint to publication detection shows global inequities in scientific publication

Authors : Peter Eckmann, Anita Bandrowski

Preprints, versions of scientific manuscripts that precede peer review, are growing in popularity. They offer an opportunity to democratize and accelerate research, as they have no publication costs or a lengthy peer review process. Preprints are often later published in peer-reviewed venues, but these publications and the original preprints are frequently not linked in any way.

To this end, we developed a tool, PreprintMatch, to find matches between preprints and their corresponding published papers, if they exist. This tool outperforms existing techniques to match preprints and papers, both on matching performance and speed. PreprintMatch was applied to search for matches between preprints (from bioRxiv and medRxiv), and PubMed.

The preliminary nature of preprints offers a unique perspective into scientific projects at a relatively early stage, and with better matching between preprint and paper, we explored questions related to research inequity.

We found that preprints from low income countries are published as peer-reviewed papers at a lower rate than high income countries (39.6% and 61.1%, respectively), and our data is consistent with previous work that cite a lack of resources, lack of stability, and policy choices to explain this discrepancy.

Preprints from low income countries were also found to be published quicker (178 vs 203 days) and with less title, abstract, and author similarity to the published version compared to high income countries. Low income countries add more authors from the preprint to the published version than high income countries (0.42 authors vs 0.32, respectively), a practice that is significantly more frequent in China compared to similar countries.

Finally, we find that some publishers publish work with authors from lower income countries more frequently than others.

URL : PreprintMatch: A tool for preprint to publication detection shows global inequities in scientific publication


Forecasting the publication and citation outcomes of COVID-19 preprints

Authors : Michael Gordon, Michael Bishop, Yiling Chen, Anna Dreber, Brandon Goldfedder, Felix Holzmeister, Magnus Johannesson, Yang Liu, Louisa Tran, Charles Twardy, Juntao Wang, Thomas Pfeiffer

Many publications on COVID-19 were released on preprint servers such as medRxiv and bioRxiv. It is unknown how reliable these preprints are, and which ones will eventually be published in scientific journals.

In this study, we use crowdsourced human forecasts to predict publication outcomes and future citation counts for a sample of 400 preprints with high Altmetric score. Most of these preprints were published within 1 year of upload on a preprint server (70%), with a considerable fraction (45%) appearing in a high-impact journal with a journal impact factor of at least 10.

On average, the preprints received 162 citations within the first year. We found that forecasters can predict if preprints will be published after 1 year and if the publishing journal has high impact. Forecasts are also informative with respect to Google Scholar citations within 1 year of upload on a preprint server.

For both types of assessment, we found statistically significant positive correlations between forecasts and observed outcomes. While the forecasts can help to provide a preliminary assessment of preprints at a faster pace than traditional peer-review, it remains to be investigated if such an assessment is suited to identify methodological problems in preprints.

URL : Forecasting the publication and citation outcomes of COVID-19 preprints


Publication practices during the COVID-19 pandemic: Expedited publishing or simply an early bird effect?

Authors : Yulia V. Sevryugina, Andrew J. Dicks

This study explores the evolution of publication practices associated with the SARS-CoV-2 research papers, namely, peer-reviewed journal and review articles indexed in PubMed and their associated preprints posted on bioRxiv and medRxiv servers: a total of 4,031 journal article-preprint pairs.

Our assessment of various publication delays during the January 2020 to March 2021 period revealed the early bird effect that lies beyond the involvement of any publisher policy action and is directly linked to the emerging nature of new and ‘hot’ scientific topics.

We found that when the early bird effect and data incompleteness are taken into account, COVID-19 related research papers show only a moderately expedited speed of dissemination as compared with the pre-pandemic era.

Medians for peer-review and production stage delays were 66 and 15 days, respectively, and the entire conversion process from a preprint to its peer-reviewed journal article version took 109.5 days.

The early bird effect produced an ephemeral perception of a global rush in scientific publishing during the early days of the coronavirus pandemic. We emphasize the importance of considering the early bird effect in interpreting publication data collected at the outset of a newly emerging event.

URL : Publication practices during the COVID-19 pandemic: Expedited publishing or simply an early bird effect?


Examining linguistic shifts between preprints and publications

Authors : David N. Nicholson, Vincent Rubinetti, Dongbo Hu, Marvin Thielk, Lawrence E. Hunter, Casey S. Greene

Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review. Studies on preprints within bioRxiv have been largely focused on article metadata and how often these preprints are downloaded, cited, published, and discussed online.

A missing element that has yet to be examined is the language contained within the bioRxiv preprint repository. We sought to compare and contrast linguistic features within bioRxiv preprints to published biomedical text as a whole as this is an excellent opportunity to examine how peer review changes these documents.

The most prevalent features that changed appear to be associated with typesetting and mentions of supporting information sections or additional files. In addition to text comparison, we created document embeddings derived from a preprint-trained word2vec model.

We found that these embeddings are able to parse out different scientific approaches and concepts, link unannotated preprint–peer-reviewed article pairs, and identify journals that publish linguistically similar papers to a given preprint.

We also used these embeddings to examine factors associated with the time elapsed between the posting of a first preprint and the appearance of a peer-reviewed publication. We found that preprints with more versions posted and more textual changes took longer to publish.

Lastly, we constructed a web application ( that allows users to identify which journals and articles that are most linguistically similar to a bioRxiv or medRxiv preprint as well as observe where the preprint would be positioned within a published article landscape.

URL : Examining linguistic shifts between preprints and publications


Reproducibility of COVID-19 pre-prints

Authors : Annie Collins, Rohan Alexander

To examine the reproducibility of COVID-19 research, we create a dataset of pre-prints posted to arXiv, bioRxiv, medRxiv, and SocArXiv between 28 January 2020 and 30 June 2021 that are related to COVID-19.

We extract the text from these pre-prints and parse them looking for keyword markers signalling the availability of the data and code underpinning the pre-print. For the pre-prints that are in our sample, we are unable to find markers of either open data or open code for 75 per cent of those on arXiv, 67 per cent of those on bioRxiv, 79 per cent of those on medRxiv, and 85 per cent of those on SocArXiv.

We conclude that there may be value in having authors categorize the degree of openness of their pre-print as part of the pre-print submissions process, and more broadly, there is a need to better integrate open science training into a wide range of fields.


The evolving role of preprints in the dissemination of COVID-19 research and their impact on the science communication landscape

Authors : Nicholas Fraser, Liam Brierley, Gautam Dey, Jessica K. Polka, Máté Pálfy, Federico Nann, Jonathon Alexis Coates

The world continues to face a life-threatening viral pandemic. The virus underlying the Coronavirus Disease 2019 (COVID-19), Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), has caused over 98 million confirmed cases and 2.2 million deaths since January 2020.

Although the most recent respiratory viral pandemic swept the globe only a decade ago, the way science operates and responds to current events has experienced a cultural shift in the interim.

The scientific community has responded rapidly to the COVID-19 pandemic, releasing over 125,000 COVID-19–related scientific articles within 10 months of the first confirmed case, of which more than 30,000 were hosted by preprint servers.

We focused our analysis on bioRxiv and medRxiv, 2 growing preprint servers for biomedical research, investigating the attributes of COVID-19 preprints, their access and usage rates, as well as characteristics of their propagation on online platforms.

Our data provide evidence for increased scientific and public engagement with preprints related to COVID-19 (COVID-19 preprints are accessed more, cited more, and shared more on various online platforms than non-COVID-19 preprints), as well as changes in the use of preprints by journalists and policymakers.

We also find evidence for changes in preprinting and publishing behaviour: COVID-19 preprints are shorter and reviewed faster.

Our results highlight the unprecedented role of preprints and preprint servers in the dissemination of COVID-19 science and the impact of the pandemic on the scientific communication landscape.

URL : The evolving role of preprints in the dissemination of COVID-19 research and their impact on the science communication landscape


Publication rate and citation counts for preprints released during the COVID-19 pandemic: the good, the bad and the ugly

Authors : Diego Añazco, Bryan Nicolalde, Isabel Espinosa, Jose Camacho , Mariam Mushtaq, Jimena Gimenez, Enrique Teran


Preprints are preliminary reports that have not been peer-reviewed. In December 2019, a novel coronavirus appeared in China, and since then, scientific production, including preprints, has drastically increased. In this study, we intend to evaluate how often preprints about COVID-19 were published in scholarly journals and cited.


We searched the iSearch COVID-19 portfolio to identify all preprints related to COVID-19 posted on bioRxiv, medRxiv, and Research Square from January 1, 2020, to May 31, 2020. We used a custom-designed program to obtain metadata using the Crossref public API.

After that, we determined the publication rate and made comparisons based on citation counts using non-parametric methods. Also, we compared the publication rate, citation counts, and time interval from posting on a preprint server to publication in a scholarly journal among the three different preprint servers.


Our sample included 5,061 preprints, out of which 288 were published in scholarly journals and 4,773 remained unpublished (publication rate of 5.7%). We found that articles published in scholarly journals had a significantly higher total citation count than unpublished preprints within our sample (p < 0.001), and that preprints that were eventually published had a higher citation count as preprints when compared to unpublished preprints (p < 0.001).

As well, we found that published preprints had a significantly higher citation count after publication in a scholarly journal compared to as a preprint (p < 0.001). Our results also show that medRxiv had the highest publication rate, while bioRxiv had the highest citation count and shortest time interval from posting on a preprint server to publication in a scholarly journal.


We found a remarkably low publication rate for preprints within our sample, despite accelerated time to publication by multiple scholarly journals. These findings could be partially attributed to the unprecedented surge in scientific production observed during the COVID-19 pandemic, which might saturate reviewing and editing processes in scholarly journals.

However, our findings show that preprints had a significantly lower scientific impact, which might suggest that some preprints have lower quality and will not be able to endure peer-reviewing processes to be published in a peer-reviewed journal.

URL : Publication rate and citation counts for preprints released during the COVID-19 pandemic: the good, the bad and the ugly