Data Science and AI in Context: Summary and Insights

Author : Alfred Spector

This paper explores how to deploy data science and data-driven AI, focusing on the broad collection of considerations beyond those of statistics and machine learning. Building on an analysis rubric introduced in a recent textbook by the author and three others, this paper summarizes some of the book’s key points and adds reflections on AI’s extraordinary growth and societal effects. The paper also discusses how to balance inevitable trade-offs and provides further thoughts on societal implications.

DOI : https://doi.org/10.1162/99608f92.cdebd845

In the Academy, Data Science Is Lonely: Barriers to Adopting Data Science Methods for Scientific Research

Authors : Gabrielle O’Brien, Jordan Mick

Data science has been heralded as a transformative family of methods for scientific discovery. Despite this excitement, putting these methods into practice in scientific research has proven challenging. We conducted a qualitative interview study of 25 researchers at the University of Michigan, all scientists who currently work outside of data science (in fields such as astronomy, education, chemistry, and political science) and wish to adopt data science methods as part of their research program.

Semi-structured interviews explored the barriers they faced and strategies scientists used to persevere. These scientists quickly identified that they lacked the expertise to confidently implement and interpret new methods.

For most, independent study was unsuccessful, owing to limited time, missing foundational skills, and difficulty navigating the marketplace of educational data science resources. Overwhelmingly, participants reported isolation in their endeavors and a desire for a greater community. Many sought to bootstrap a community on their own, with mixed results.

Based on their narratives, we provide preliminary recommendations for academic departments, training programs, campus-wide data science initiatives, and universities to build supportive communities of practice that cultivate expertise. These community relationships may be key to growing the research capacity of scientific institutions. 

DOI : https://doi.org/10.1162/99608f92.7ca04767

Data Science at the Singularity

Author : David Donoho

Something fundamental to computation-based research has really changed in the last ten years. In certain fields, progress is simply dramatically more rapid than previously. Researchers in affected fields are living through a period of profound transformation, as the fields undergo a transition to frictionless reproducibility (FR).

This transition markedly changes the rate of spread of ideas and practices, affects scientific mindsets and the goals of science, and erases memories of much that came before. The emergence of FR flows from 3 data science principles that matured together after decades of work by many technologists and numerous research communities.

The mature principles involve data sharing, code sharing, and competitive challenges, however implemented in the particularly strong form of frictionless open services. Empirical Machine Learning is today’s leading adherent field; its hidden superpower is adherence to frictionless reproducibility practices; these practices are responsible for the striking and surprising progress in AI that we see everywhere; they can be learned and adhered to by researchers in whatever research field, automatically increasing the rate of progress in each adherent field.

URL : Data Science at the Singularity

DOI : https://doi.org/10.1162/99608f92.b91339ef

Looking Back to the Future: A Glimpse at Twenty Years of Data Science

Author : Lili Zhang

This paper carries out a lightweight review to explore the potentials of data science in the last two decades and especially focuses on the four essential components: data resources, technologies, data infrastructures, and data education.

Considering the barriers of data science, the analysis has been mapped into four essential components, highlighting priorities and challenges in social and cultural, epistemological, scientific and technical, economic, legal, and ethical aspects.

As a result, the future development of data science tends to shift toward datafication, data technicity, infrastructuralism, and data literacy empowerment. The data ecosystem, at the macro level, has also been analyzed under the open science umbrella, providing a snapshot for the future development of data science.

URL : Looking Back to the Future: A Glimpse at Twenty Years of Data Science

DOI : https://doi.org/10.5334/dsj-2023-007

Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

Authors : Moritz Schubotz, Ankit Satpute, André Greiner-Petter, Akiko Aizawa, Bela Gipp

Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access.

The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computational expensive experiments.

In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written.

This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow.

Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans.

We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.

URL : Caching and Reproducibility: Making Data Science Experiments Faster and FAIRer

DOI : https://doi.org/10.3389/frma.2022.861944

Data Science Tools for Monitoring the Global Repository Eco-System and its Lines of Evolution

Authors : Friedrich Summann, Andreas Czerniak, Jochen Schirrwagen, Dirk Pieper

The global network of scholarly repositories for the publication and dissemination of scientific publications and related materials can already look back on a history of more than twenty years.

During this period, there have been many developments in terms of technical optimization and the increase of content. It is crucial to observe and analyze this evolution in order to draw conclusions for the further development of repositories.

The basis for such an analysis is data. The Open Archives Initiative (OAI) service provider Bielefeld Academic Search Engine (BASE) started indexing repositories in 2004 and has collected metadata also on repositories.

This paper presents the main features of a planned repository monitoring system. Data have been collected since 2004 and includes basic repository metadata as well as publication metadata of a repository.

This information allows an in-depth analysis of many indicators in different logical combinations. This paper outlines the systems approach and the integration of data science techniques. It describes the intended monitoring system and shows the first results.

URL : Data Science Tools for Monitoring the Global Repository Eco-System and its Lines of Evolution

DOI : https://doi.org/10.3390/publications8020035

Developing the Librarian Workforce for Data Science and Open Science

Authors : Lisa Federer, Sarah Clarke, Maryam Zaringhalam

URL : Developing the Librarian Workforce for Data Science and Open Science