Data Science at the Singularity

Author : David Donoho

Something fundamental to computation-based research has really changed in the last ten years. In certain fields, progress is simply dramatically more rapid than previously. Researchers in affected fields are living through a period of profound transformation, as the fields undergo a transition to frictionless reproducibility (FR).

This transition markedly changes the rate of spread of ideas and practices, affects scientific mindsets and the goals of science, and erases memories of much that came before. The emergence of FR flows from 3 data science principles that matured together after decades of work by many technologists and numerous research communities.

The mature principles involve data sharing, code sharing, and competitive challenges, however implemented in the particularly strong form of frictionless open services. Empirical Machine Learning is today’s leading adherent field; its hidden superpower is adherence to frictionless reproducibility practices; these practices are responsible for the striking and surprising progress in AI that we see everywhere; they can be learned and adhered to by researchers in whatever research field, automatically increasing the rate of progress in each adherent field.

URL : Data Science at the Singularity

DOI : https://doi.org/10.1162/99608f92.b91339ef

Emerging roles and responsibilities of libraries in support of reproducible research

Authors : Birgit Schmidt, Andrea Chiarelli, Lucia Loffreda, Jeroen Sondervan

Ensuring the reproducibility of research is a multi-stakeholder effort that comes with challenges and opportunities for individual researchers and research communities, librarians, publishers, funders and service providers. These emerge at various steps of the research process, and, in particular, at the publication stage.

Previous work by Knowledge Exchange highlighted that, while there is growing awareness among researchers, reproducible publication practices have been slow to change. Importantly, research reproducibility has not yet reached institutional agendas: this work seeks to highlight the rationale for libraries to initiate and/or step up their engagement with this topic, which we argue is well aligned with their core values and strategic priorities.

We draw on secondary analysis of data gathered by Knowledge Exchange, focusing on the literature identified as well as interviews held with librarians. We extend this through further investigation of the literature and by integrating the findings of discussions held at the 2022 LIBER conference, to provide an updated picture of how libraries engage with research reproducibility.

Libraries have a significant role in promoting responsible research practices, including transparency and reproducibility, by leveraging their connections to academic communities and collaborating with stakeholders like research funders and publishers. Our recommendations for libraries include: i) partnering with researchers to promote a research culture that values transparency and reproducibility, ii) enhancing existing research infrastructure and support; and iii) investing in raising awareness and developing skills and capacities related to these principles.

URL : Emerging roles and responsibilities of libraries in support of reproducible research

DOI : https://doi.org/10.53377/lq.14947

Reproducibility in Management Science

Authors : Miloš Fišar, Ben Greiner, Christoph Huber, Elena Katok, Ali I. Ozkes

With the help of more than 700 reviewers, we assess the reproducibility of nearly 500 articles published in the journal Management Science before and after the introduction of a new Data and Code Disclosure policy in 2019.

When considering only articles for which data accessibility and hardware and software requirements were not an obstacle for reviewers, the results of more than 95% of articles under the new disclosure policy could be fully or largely computationally reproduced. However, for 29% of articles, at least part of the data set was not accessible to the reviewer. Considering all articles in our sample reduces the share of reproduced articles to 68%.

These figures represent a significant increase compared with the period before the introduction of the disclosure policy, where only 12% of articles voluntarily provided replication materials, of which 55% could be (largely) reproduced. Substantial heterogeneity in reproducibility rates across different fields is mainly driven by differences in data set accessibility.

Other reasons for unsuccessful reproduction attempts include missing code, unresolvable code errors, weak or missing documentation, and software and hardware requirements and code complexity. Our findings highlight the importance of journal code and data disclosure policies and suggest potential avenues for enhancing their effectiveness.

DOI : https://doi.org/10.1287/mnsc.2023.03556

Reproducible research practices and transparency across linguistics

Authors : Agata Bochynska, Liam Keeble, Caitlin Halfacre, Joseph V. Casillas, Irys-Amélie Champagne, Kaidi Chen, Melanie Röthlisberger, Erin M. Buchanan, Timo B. Roettger

Scientific studies of language span across many disciplines and provide evidence for social,  cultural, cognitive, technological, and biomedical studies of human nature and behavior. As it becomes increasingly empirical and quantitative, linguistics has been facing challenges and limitations of the scientific practices that pose barriers to reproducibility and replicability.

One of the  proposed solutions to the widely acknowledged reproducibility and replicability crisis has been the implementation of transparency practices,  e.g., open access publishing, preregistrations, sharing study materials, data, and analyses, performing study replications, and declaring conflicts of interest.

Here, we have assessed the prevalence of these practices in 600 randomly sampled journal articles from linguistics across two time points. In line with similar studies in other disciplines, we found that 35% of the articles were published open access and the rates of sharing materials, data, and protocols were below 10%. None of the articles reported preregistrations, 1% reported replications, and 10% had conflict of interest statements.

These rates have not increased noticeably between 2008/2009 and 2018/2019, pointing to remaining barriers and the slow adoption of open and reproducible research practices in linguistics.

To facilitate adoption of these practices, we provide a range of recommendations and solutions for implementing transparency and improving reproducibility of research in linguistics.

URL : Reproducible research practices and transparency across linguistics

DOI : https://doi.org/10.5070/G6011239

Analytical code sharing practices in biomedical research

Authors : Nitesh Kumar Sharma, Ram Ayyala, Dhrithi Deshpande et al.

Data-driven computational analysis is becoming increasingly important in biomedical research, as the amount of data being generated continues to grow. However, the lack of practices of sharing research outputs, such as data, source code and methods, affects transparency and reproducibility of studies, which are critical to the advancement of science. Many published studies are not reproducible due to insufficient documentation, code, and data being shared.

We conducted a comprehensive analysis of 453 manuscripts published between 2016-2021 and found that 50.1% of them fail to share the analytical code. Even among those that did disclose their code, a vast majority failed to offer additional research outputs, such as data. Furthermore, only one in ten papers organized their code in a structured and reproducible manner. We discovered a significant association between the presence of code availability statements and increased code availability (p=2.71×10−9).

Additionally, a greater proportion of studies conducting secondary analyses were inclined to share their code compared to those conducting primary analyses (p=1.15*10−07). In light of our findings, we propose raising awareness of code sharing practices and taking immediate steps to enhance code availability to improve reproducibility in biomedical research.

By increasing transparency and reproducibility, we can promote scientific rigor, encourage collaboration, and accelerate scientific discoveries. We must prioritize open science practices, including sharing code, data, and other research products, to ensure that biomedical research can be replicated and built upon by others in the scientific community.

URL : Analytical code sharing practices in biomedical research

DOI : https://doi.org/10.1101/2023.07.31.551384

Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

Authors : Moritz Schubotz, Ankit Satpute, André Greiner-Petter, Akiko Aizawa, Bela Gipp

Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access.

The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computational expensive experiments.

In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written.

This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow.

Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans.

We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.

URL : Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

DOI : https://doi.org/10.3389/frma.2022.861944

RipetaScore: Measuring the Quality, Transparency, and Trustworthiness of a Scientific Work

Authors : Josh Q. Sumner, Cynthia Hudson Vitale, Leslie D. McIntosh

A wide array of existing metrics quantifies a scientific paper’s prominence or the author’s prestige. Many who use these metrics make assumptions that higher citation counts or more public attention must indicate more reliable, better quality science.

While current metrics offer valuable insight into scientific publications, they are an inadequate proxy for measuring the quality, transparency, and trustworthiness of published research.

Three essential elements to establishing trust in a work include: trust in the paper, trust in the author, and trust in the data. To address these elements in a systematic and automated way, we propose the ripetaScore as a direct measurement of a paper’s research practices, professionalism, and reproducibility.

Using a sample of our current corpus of academic papers, we demonstrate the ripetaScore’s efficacy in determining the quality, transparency, and trustworthiness of an academic work.

In this paper, we aim to provide a metric to evaluate scientific reporting quality in terms of transparency and trustworthiness of the research, professionalism, and reproducibility.

URL : RipetaScore: Measuring the Quality, Transparency, and Trustworthiness of a Scientific Work

DOI : https://doi.org/10.3389/frma.2021.751734