Text mining arXiv: a look through quantitative finance papers

Author : Michele Leonardo Bianchi

This paper explores articles hosted on the arXiv preprint server with the aim to uncover valuable insights hidden in this vast collection of research. Employing text mining techniques and through the application of natural language processing methods, we examine the contents of quantitative finance papers posted in arXiv from 1997 to 2022.

We extract and analyze crucial information from the entire documents, including the references, to understand the topics trends over time and to find out the most cited researchers and journals on this domain. Additionally, we compare numerous algorithms to perform topic modeling, including state-of-the-art approaches.

Arxiv : https://arxiv.org/abs/2401.01751

ODDPub – a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications

Authors: Nico Riedel, Miriam Kip, Evgeny Bobro

Open research data are increasingly recognized as a quality indicator and an important resource to increase transparency, robustness and collaboration in science. However, no standardized way of reporting Open Data in publications exists, making it difficult to find shared datasets and assess the prevalence of Open Data in an automated fashion.

We developed ODDPub (Open Data Detection in Publications), a text-mining algorithm that screens biomedical publications and detects cases of Open Data. Using English-language original research publications from a single biomedical research institution (n = 8689) and randomly selected from PubMed (n = 1500) we iteratively developed a set of derived keyword categories.

ODDPub can detect data sharing through field-specific repositories, general-purpose repositories or the supplement. Additionally, it can detect shared analysis code (Open Code).

To validate ODDPub, we manually screened 792 publications randomly selected from PubMed. On this validation dataset, our algorithm detected Open Data publications with a sensitivity of 0.73 and specificity of 0.97.

Open Data was detected for 11.5% (n = 91) of publications. Open Code was detected for 1.4% (n = 11) of publications with a sensitivity of 0.73 and specificity of 1.00. We compared our results to the linked datasets found in the databases PubMed and Web of Science.

Our algorithm can automatically screen large numbers of publications for Open Data. It can thus be used to assess Open Data sharing rates on the level of subject areas, journals, or institutions. It can also identify individual Open Data publications in a larger publication corpus. ODDPub is published as an R package on GitHub.

URL : ODDPub – a Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications

DOI : http://doi.org/10.5334/dsj-2020-042

Text Data Mining from the Author’s Perspective: Whose Text, Whose Mining, and to Whose Benefit?

Authors : Christine L. Borgman

Given the many technical, social, and policy shifts in access to scholarly content since the early days of text data mining, it is time to expand the conversation about text data mining from concerns of the researcher wishing to mine data to include concerns of researcher-authors about how their data are mined, by whom, for what purposes, and to whose benefits.

URL : https://arxiv.org/abs/1803.04552

Le temps des SIC

Auteurs/Authors : Gabriel Gallezot, Marty Emmanuel

Pour rendre compte du temps des Sciences de l’Information et de la Communication (SIC), nous avons choisi d’analyser le lexique des chercheurs. Notre étude s’appuie sur les textes librement déposés par les auteurs sur la plateforme HAL/@sic.

La fouille de texte s’effectue par une série d’analyses lexicométriques afin de répondre à deux objectifs : appréhender les notions liées au temps dans les recherches en SIC, d’une part, d’autre part rendre compte de l’évolution dans le temps des champs et questions de recherche en SIC.

URL : https://archivesic.ccsd.cnrs.fr/sic_01599944

Bibliometric methods for detecting and analysing emerging research topics

This study gives an overview of the process of clustering scientific disciplines using hybrid methods, detecting and labelling emerging topics and analysing the results using bibliometrics methods.

The hybrid clustering techniques are based on biblographic coupling and text-mining and ‘core documents’, and cross-citation links are used to identify emerging fields.

The collaboration network of those countries that proved to be most active in the underlying disciplines, in combination with a set of standard indicators, form the groundwork for the bibliometric analysis of the detected emerging research topics.

URL : http://hdl.handle.net/10760/16947

Value and benefits of text mining Vast…

Value and benefits of text mining :

“Vast amounts of new information and data are generated everyday through economic, academic and social activities. This sea of data, predicted to increase at a rate of 40% p.a., has significant potential economic and societal value. Techniques such as text and data mining and analytics are required to exploit this potential.
Businesses use such techniques to analyse customer and competitor data to improve competitiveness; the pharmaceutical industry mines patents and research articles to improve drug discovery; within academic research, mining and analytics of large datasets are delivering efficiencies and new knowledge in areas as diverse as biological science, particle physics and media and communications.
We have explored the costs, benefits, barriers and risks associated with text mining within UKFHE research using the approach to welfare economics laid out in the UK Treasury best practice guidelines for evaluation.”

URL : http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx