Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles

Authors : Jens Klump, Lesley Wyborn, Mingfang Wu, Julia Martin, Robert R. Downs, Ari Asmi

A dataset, small or big, is often changed to correct errors, apply new algorithms, or add new data (e.g., as part of a time series), etc.

In addition, datasets might be bundled into collections, distributed in different encodings or mirrored onto different platforms. All these differences between versions of datasets need to be understood by researchers who want to cite the exact version of the dataset that was used to underpin their research.

Failing to do so reduces the reproducibility of research results. Ambiguous identification of datasets also impacts researchers and data centres who are unable to gain recognition and credit for their contributions to the collection, creation, curation and publication of individual datasets.

Although the means to identify datasets using persistent identifiers have been in place for more than a decade, systematic data versioning practices are currently not available. In this work, we analysed 39 use cases and current practices of data versioning across 33 organisations.

We noticed that the term ‘version’ was used in a very general sense, extending beyond the more common understanding of ‘version’ to refer primarily to revisions and replacements. Using concepts developed in software versioning and the Functional Requirements for Bibliographic Records (FRBR) as a conceptual framework, we developed six foundational principles for versioning of datasets: Revision, Release, Granularity, Manifestation, Provenance and Citation.

These six principles provide a high-level framework for guiding the consistent practice of data versioning and can also serve as guidance for data centres or data providers when setting up their own data revision and version protocols and procedures.

URL : Versioning Data Is About More than Revisions: A Conceptual Framework and Proposed Principles

DOI : http://doi.org/10.5334/dsj-2021-012

Data Discovery Paradigms: User Requirements and Recommendations for Data Repositories

Authors: Mingfang Wu, Fotis Psomopoulos, Siri Jodha Khalsa, Anita de Waard

As data repositories make more data openly available it becomes challenging for researchers to find what they need either from a repository or through web search engines.

This study attempts to investigate data users’ requirements and the role that data repositories can play in supporting data discoverability by meeting those requirements.

We collected 79 data discovery use cases (or data search scenarios), from which we derived nine functional requirements for data repositories through qualitative analysis.

We then applied usability heuristic evaluation and expert review methods to identify best practices that data repositories can implement to meet each functional requirement.

We propose the following ten recommendations for data repository operators to consider for improving data discoverability and user’s data search experience:

1. Provide a range of query interfaces to accommodate various data search behaviours.

2. Provide multiple access points to find data.

3. Make it easier for researchers to judge relevance, accessibility and reusability of a data collection from a search summary.

4. Make individual metadata records readable and analysable.

5. Enable sharing and downloading of bibliographic references.

6. Expose data usage statistics.

7. Strive for consistency with other repositories.

8. Identify and aggregate metadata records that describe the same data object.

9. Make metadata records easily indexed and searchable by major web search engines.

10. Follow API search standards and community adopted vocabularies for interoperability.

DOI : http://doi.org/10.5334/dsj-2019-003