Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles

Authors : Jens Klump, Lesley Wyborn, Mingfang Wu, Julia Martin, Robert R. Downs, Ari Asmi

A dataset, small or big, is often changed to correct errors, apply new algorithms, or add new data (e.g., as part of a time series), etc.

In addition, datasets might be bundled into collections, distributed in different encodings or mirrored onto different platforms. All these differences between versions of datasets need to be understood by researchers who want to cite the exact version of the dataset that was used to underpin their research.

Failing to do so reduces the reproducibility of research results. Ambiguous identification of datasets also impacts researchers and data centres who are unable to gain recognition and credit for their contributions to the collection, creation, curation and publication of individual datasets.

Although the means to identify datasets using persistent identifiers have been in place for more than a decade, systematic data versioning practices are currently not available. In this work, we analysed 39 use cases and current practices of data versioning across 33 organisations.

We noticed that the term ‘version’ was used in a very general sense, extending beyond the more common understanding of ‘version’ to refer primarily to revisions and replacements. Using concepts developed in software versioning and the Functional Requirements for Bibliographic Records (FRBR) as a conceptual framework, we developed six foundational principles for versioning of datasets: Revision, Release, Granularity, Manifestation, Provenance and Citation.

These six principles provide a high-level framework for guiding the consistent practice of data versioning and can also serve as guidance for data centres or data providers when setting up their own data revision and version protocols and procedures.

URL : Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles

DOI : http://doi.org/10.5334/dsj-2021-012

A Review of Open Research Data Policies and Practices in China

Authors: Lili Zhang, Robert R. Downs, Jianhui Li, Liangming Wen, Chengzan Li

This paper initially conducts a literature review and content analysis of the open research data policies in China. Next, a series of exemplars describe data practices to promote and enable the use of open research data, including open data practices in research programs, data repositories, data journals, and citizen science.

Moreover, the top four driving forces are identified and analyzed along with their responsible guiding work. In addition, the “landscape of open research data ecology in China” is derived from the literature review and from observations of actual cases, where the interaction and mutual development of data policies, data programs, and data practices are recognized.

Finally, future trends of research data practices within China and internationally are discussed. We hope the analysis provides perspective on current open data practices in China along with insight into the need for additional research on scientific data sharing and management.

URL : A Review of Open Research Data Policies and Practices in China

DOI : http://doi.org/10.5334/dsj-2021-003

Improving Opportunities for New Value of Open Data: Assessing and Certifying Research Data Repositories

Author : Robert R. Downs

Investments in research that produce scientific and scholarly data can be leveraged by enabling the resulting research data products and services to be used by broader communities and for new purposes, extending reuse beyond the initial users and purposes for which the data were originally collected.

Submitting research data to a data repository offers opportunities for the data to be used in the future, providing ways for new benefits to be realized from data reuse. Improvements to data repositories that facilitate new uses of data increase the potential for data reuse and for gains in the value of open data products and services that are associated with such reuse.

Assessing and certifying the capabilities and services offered by data repositories provides opportunities for improving the repositories and for realizing the value to be attained from new uses of data.

The evolution of data repository certification instruments is described and discussed in terms of the implications for the curation and continuing use of research data.

URL : Improving Opportunities for New Value of Open Data: Assessing and Certifying Research Data Repositories

DOI : http://doi.org/10.5334/dsj-2021-001

Risk Assessment for Scientific Data

Authors : Matthew S. Mayernik, Kelsey Breseman, Robert R. Downs, Ruth Duerr, Alexis Garretson, Chung-Yi (Sophie) Hou

Ongoing stewardship is required to keep data collections and archives in existence. Scientific data collections may face a range of risk factors that could hinder, constrain, or limit current or future data use.

Identifying such risk factors to data use is a key step in preventing or minimizing data loss. This paper presents an analysis of data risk factors that scientific data collections may face, and a data risk assessment matrix to support data risk assessments to help ameliorate those risks.

The goals of this work are to inform and enable effective data risk assessment by: a) individuals and organizations who manage data collections, and b) individuals and organizations who want to help to reduce the risks associated with data preservation and stewardship.

The data risk assessment framework presented in this paper provides a platform from which risk assessments can begin, and a reference point for discussions of data stewardship resource allocations and priorities.

URL : Risk Assessment for Scientific Data

DOI : http://doi.org/10.5334/dsj-2020-010

A Discussion of Value Metrics for Data Repositories in Earth and Environmental Sciences

Authors : Cynthia Parr, Corinna Gries, Margaret O’Brien, Robert R. Downs, Ruth Duerr, Rebecca Koskela, Philip Tarrant, Keith E. Maull, Nancy Hoebelheinrich, Shelley Stall

Despite growing recognition of the importance of public data to the modern economy and to scientific progress, long-term investment in the repositories that manage and disseminate scientific data in easily accessible-ways remains elusive. Repositories are asked to demonstrate that there is a net value of their data and services to justify continued funding or attract new funding sources.

Here, representatives from a number of environmental and Earth science repositories evaluate approaches for assessing the costs and benefits of publishing scientific data in their repositories, identifying various metrics that repositories typically use to report on the impact and value of their data products and services, plus additional metrics that would be useful but are not typically measured.

We rated each metric by (a) the difficulty of implementation by our specific repositories and (b) its importance for value determination. As managers of environmental data repositories, we find that some of the most easily obtainable data-use metrics (such as data downloads and page views) may be less indicative of value than metrics that relate to discoverability and broader use.

Other intangible but equally important metrics (e.g., laws or regulations impacted, lives saved, new proposals generated), will require considerable additional research to describe and develop, plus resources to implement at scale.

As value can only be determined from the point of view of a stakeholder, it is likely that multiple sets of metrics will be needed, tailored to specific stakeholder needs. Moreover, economically based analyses or the use of specialists in the field are expensive and can happen only as resources permit.

URL : A Discussion of Value Metrics for Data Repositories in Earth and Environmental Sciences

DOI : http://doi.org/10.5334/dsj-2019-058