In the Academy, Data Science Is Lonely: Barriers to Adopting Data Science Methods for Scientific Research

Authors : Gabrielle O’Brien, Jordan Mick

Data science has been heralded as a transformative family of methods for scientific discovery. Despite this excitement, putting these methods into practice in scientific research has proven challenging. We conducted a qualitative interview study of 25 researchers at the University of Michigan, all scientists who currently work outside of data science (in fields such as astronomy, education, chemistry, and political science) and wish to adopt data science methods as part of their research program.

Semi-structured interviews explored the barriers they faced and strategies scientists used to persevere. These scientists quickly identified that they lacked the expertise to confidently implement and interpret new methods.

For most, independent study was unsuccessful, owing to limited time, missing foundational skills, and difficulty navigating the marketplace of educational data science resources. Overwhelmingly, participants reported isolation in their endeavors and a desire for a greater community. Many sought to bootstrap a community on their own, with mixed results.

Based on their narratives, we provide preliminary recommendations for academic departments, training programs, campus-wide data science initiatives, and universities to build supportive communities of practice that cultivate expertise. These community relationships may be key to growing the research capacity of scientific institutions. 

DOI : https://doi.org/10.1162/99608f92.7ca04767

Data Science at the Singularity

Author : David Donoho

Something fundamental to computation-based research has really changed in the last ten years. In certain fields, progress is simply dramatically more rapid than previously. Researchers in affected fields are living through a period of profound transformation, as the fields undergo a transition to frictionless reproducibility (FR).

This transition markedly changes the rate of spread of ideas and practices, affects scientific mindsets and the goals of science, and erases memories of much that came before. The emergence of FR flows from 3 data science principles that matured together after decades of work by many technologists and numerous research communities.

The mature principles involve data sharing, code sharing, and competitive challenges, however implemented in the particularly strong form of frictionless open services. Empirical Machine Learning is today’s leading adherent field; its hidden superpower is adherence to frictionless reproducibility practices; these practices are responsible for the striking and surprising progress in AI that we see everywhere; they can be learned and adhered to by researchers in whatever research field, automatically increasing the rate of progress in each adherent field.

URL : Data Science at the Singularity

DOI : https://doi.org/10.1162/99608f92.b91339ef

Looking Back to the Future: A Glimpse at Twenty Years of Data Science

Author : Lili Zhang

This paper carries out a lightweight review to explore the potentials of data science in the last two decades and especially focuses on the four essential components: data resources, technologies, data infrastructures, and data education.

Considering the barriers of data science, the analysis has been mapped into four essential components, highlighting priorities and challenges in social and cultural, epistemological, scientific and technical, economic, legal, and ethical aspects.

As a result, the future development of data science tends to shift toward datafication, data technicity, infrastructuralism, and data literacy empowerment. The data ecosystem, at the macro level, has also been analyzed under the open science umbrella, providing a snapshot for the future development of data science.

URL : Looking Back to the Future: A Glimpse at Twenty Years of Data Science

DOI : https://doi.org/10.5334/dsj-2023-007

Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

Authors : Moritz Schubotz, Ankit Satpute, André Greiner-Petter, Akiko Aizawa, Bela Gipp

Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access.

The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computational expensive experiments.

In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written.

This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow.

Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans.

We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.

URL : Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

DOI : https://doi.org/10.3389/frma.2022.861944

Data Science Tools for Monitoring the Global Repository Eco-System and its Lines of Evolution

Authors : Friedrich Summann, Andreas Czerniak, Jochen Schirrwagen, Dirk Pieper

The global network of scholarly repositories for the publication and dissemination of scientific publications and related materials can already look back on a history of more than twenty years.

During this period, there have been many developments in terms of technical optimization and the increase of content. It is crucial to observe and analyze this evolution in order to draw conclusions for the further development of repositories.

The basis for such an analysis is data. The Open Archives Initiative (OAI) service provider Bielefeld Academic Search Engine (BASE) started indexing repositories in 2004 and has collected metadata also on repositories.

This paper presents the main features of a planned repository monitoring system. Data have been collected since 2004 and includes basic repository metadata as well as publication metadata of a repository.

This information allows an in-depth analysis of many indicators in different logical combinations. This paper outlines the systems approach and the integration of data science techniques. It describes the intended monitoring system and shows the first results.

URL : Data Science Tools for Monitoring the Global Repository Eco-System and its Lines of Evolution

DOI : https://doi.org/10.3390/publications8020035

Developing the Librarian Workforce for Data Science and Open Science

Authors : Lisa Federer, Sarah Clarke, Maryam Zaringhalam

URL : Developing the Librarian Workforce for Data Science and Open Science

Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing

Authors : Kevin M. Mendez, Leighton Pritchard, Stacey N. Reinke, David I. Broadhurst

Background

A lack of transparency and reporting standards in the scientific community has led to increasing and widespread concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility.

The metabolomics community has made substantial efforts to align with FAIR data standards by promoting open data formats, data repositories, online spectral libraries, and metabolite databases.

Open data analysis platforms also exist; however, they tend to be inflexible and rely on the user to adequately report their methods and results. To enable FAIR data science in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully integrated with the published work.

To ensure broad use within the community such a framework also needs to be inclusive and intuitive for both computational novices and experts alike.

Aim of Review

To encourage metabolomics researchers from all backgrounds to take control of their own data science, mould it to their personal requirements, and enthusiastically share resources through open science.

Key Scientific Concepts of Review

This tutorial introduces the concept of interactive web-based computational laboratory notebooks. The reader is guided through a set of experiential tutorials specifically targeted at metabolomics researchers, based around the Jupyter Notebook web application, GitHub data repository, and Binder cloud computing platform.

URL : Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing

DOI : https://doi.org/10.1007/s11306-019-1588-0