Text mining arXiv: a look through quantitative finance papers

Author : Michele Leonardo Bianchi

This paper explores articles hosted on the arXiv preprint server with the aim to uncover valuable insights hidden in this vast collection of research. Employing text mining techniques and through the application of natural language processing methods, we examine the contents of quantitative finance papers posted in arXiv from 1997 to 2022.

We extract and analyze crucial information from the entire documents, including the references, to understand the topics trends over time and to find out the most cited researchers and journals on this domain. Additionally, we compare numerous algorithms to perform topic modeling, including state-of-the-art approaches.

Arxiv : https://arxiv.org/abs/2401.01751

A framework for improving the accessibility of research papers on arXiv.org

Authors : Shamsi Brinn, Christopher Cameron, David Fielding, Charles Frankston, Alison Fromme, Peter Huang, Mark Nazzaro, Stephanie Orphan, Steinn Sigurdsson, Ryan Tay, Miranda Yang, Qianyu Zhou

The research content hosted by arXiv is not fully accessible to everyone due to disabilities and other barriers. This matters because a significant proportion of people have reading and visual disabilities, it is important to our community that arXiv is as open as possible, and if science is to advance, we need wide and diverse participation.

In addition, we have mandates to become accessible, and accessible content benefits everyone. In this paper, we will describe the accessibility problems with research, review current mitigations (and explain why they aren’t sufficient), and share the results of our user research with scientists and accessibility experts.

Finally, we will present arXiv’s proposed next step towards more open science: offering HTML alongside existing PDF and TeX formats. An accessible HTML version of this paper is also available at https://info.arxiv.org/about/accessibility_research_report.html

URL : https://arxiv.org/abs/2212.07286

Reproducibility of COVID-19 pre-prints

Authors : Annie Collins, Rohan Alexander

To examine the reproducibility of COVID-19 research, we create a dataset of pre-prints posted to arXiv, bioRxiv, medRxiv, and SocArXiv between 28 January 2020 and 30 June 2021 that are related to COVID-19.

We extract the text from these pre-prints and parse them looking for keyword markers signalling the availability of the data and code underpinning the pre-print. For the pre-prints that are in our sample, we are unable to find markers of either open data or open code for 75 per cent of those on arXiv, 67 per cent of those on bioRxiv, 79 per cent of those on medRxiv, and 85 per cent of those on SocArXiv.

We conclude that there may be value in having authors categorize the degree of openness of their pre-print as part of the pre-print submissions process, and more broadly, there is a need to better integrate open science training into a wide range of fields.

URL : https://arxiv.org/abs/2107.10724

Preprint Abstracts in Times of Crisis: a Comparative Study with the Pre-pandemic Period

Authors : Frédérique Bordignon, Liana Ermakova, Marianne Noel

The urgency to respond to the COVID-19 outbreak has driven an unprecedented surge in preprints that aim to speed up knowledge dissemination as they are available much sooner than peer-reviewed publications.

In this study we consider abstracts of research articles and preprints as main entry points that draw attention to the most important information of the document and that try to entice us to read the whole article. In this paper, we try to capture and examine shifts in scientific abstract writing produced at the very beginning of the pandemic.

We made a comparative study of abstracts in terms of their informativeness associated with preprints issued in response to the COVID-19 pandemic and those produced in 2019, the closest pre-pandemic period. Our results clearly differ from one preprint server to another and show that there are community-centered habits as regards writing and reporting results.

The preprints issued from the arXiv, ChemRxiv and Research Square servers tend to have more informative (generous) abstracts than the ones submitted to the other servers. In four servers, the ratio of structured abstracts decreases with the pandemic.

URL : Preprint Abstracts in Times of Crisis: a Comparative Study with the Pre-pandemic Period

Original location : https://hal-enpc.archives-ouvertes.fr/hal-03187900

Is preprint the future of science? A thirty year journey of online preprint services

Authors : Boya Xie, Zhihong Shen, Kuansan Wang

Preprint is a version of a scientific paper that is publicly distributed preceding formal peer review. Since the launch of arXiv in 1991, preprints have been increasingly distributed over the Internet as opposed to paper copies.

It allows open online access to disseminate the original research within a few days, often at a very low operating cost. This work overviews how preprint has been evolving and impacting the research community over the past thirty years alongside the growth of the Web.

In this work, we first report that the number of preprints has exponentially increased 63 times in 30 years, although it only accounts for 4% of research articles. Second, we quantify the benefits that preprints bring to authors: preprints reach an audience 14 months earlier on average and associate with five times more citations compared with a non-preprint counterpart. Last, to address the quality concern of preprints, we discover that 41% of preprints are ultimately published at a peer-reviewed destination, and the published venues are as influential as papers without a preprint version.

Additionally, we discuss the unprecedented role of preprints in communicating the latest research data during recent public health emergencies. In conclusion, we provide quantitative evidence to unveil the positive impact of preprints on individual researchers and the community.

Preprints make scholarly communication more efficient by disseminating scientific discoveries more rapidly and widely with the aid of Web technologies. The measurements we present in this study can help researchers and policymakers make informed decisions about how to effectively use and responsibly embrace a preprint culture.

URL : https://arxiv.org/abs/2102.09066

Being published successfully or getting arXived? The importance of social capital and interdisciplinary collaboration for getting printed in a high impact journal in Physics

Authors : Oliver J. Wieczorek, Mark Wittek, Raphael H. Heiberger

The structure of collaboration is known to be of great importance for the success of scientific endeavors. In particular, various types of social capital employed in co-authored work and projects bridging disciplinary boundaries have attracted researchers’ interest.

Almost all previous studies, however, use samples with an inherent survivor bias, i.e., they focus on papers that have already been published. In contrast, our article examines the chances for getting a working paper published by using a unique dataset of 245,000 papers uploaded to arXiv.

ArXiv is a popular preprint platform in Physics which allows us to construct a co-authorship network from which we can derive different types of social capital and interdisciplinary teamwork.

To emphasize the ‘normal case’ of community-specific standards of excellence, we assess publications in Physics’ high impact journals as success. Utilizing multilevel event history models, our results reveal that already a moderate number of persistent collaborations spanning at least two years is the most important social antecedent of getting a manuscript published successfully.

In contrast, inter- and subdisciplinary collaborations decrease the probability of publishing in an eminent journal in Physics, which can only partially be mitigated by scientists’ social capital.

URL : https://arxiv.org/abs/2006.02148

arXiv and the Symbiosis of Physics Preprints and Journal Review Articles: A Model

Author : Brian Simboli

This paper recommends a publishing model that can help achieve the goal of reforming physics publishing. It distinguishes two complementary needs in scholarly communication.

Preprints, increasingly important in science, are properly the vehicle for claiming priority of discovery and for eliciting feedback that will help with versioning.

Traditional journal publishing, however, should focus on providing synthesis in the form of overlay journals that play the same role as review articles.

URL : https://arxiv.org/abs/1904.01470