Data Sharing: Convert Challenges into Opportunities

Author : Ana Sofia Figueiredo

Initiatives for sharing research data are opportunities to increase the pace of knowledge discovery and scientific progress. The reuse of research data has the potential to avoid the duplication of data sets and to bring new views from multiple analysis of the same data set.

For example, the study of genomic variations associated with cancer profits from the universal collection of such data and helps in selecting the most appropriate therapy for a specific patient. However, data sharing poses challenges to the scientific community.

These challenges are of ethical, cultural, legal, financial, or technical nature. This article reviews the impact that data sharing has in science and society and presents guidelines to improve the efficient sharing of research data.

URL : Data Sharing: Convert Challenges into Opportunities

DOI : https://doi.org/10.3389/fpubh.2017.00327

Balancing the local and the universal in maintaining ethical access to a genomics biobank

Authors : Catherine Heeney, Shona M. Kerr

Background

Issues of balancing data accessibility with ethical considerations and governance of a genomics research biobank, Generation Scotland, are explored within the evolving policy landscape of the past ten years. During this time data sharing and open data access have become increasingly important topics in biomedical research.

Decisions around data access are influenced by local arrangements for governance and practices such as linkage to health records, and the global through policies for biobanking and the sharing of data with large-scale biomedical research data resources and consortia.

Methods

We use a literature review of policy relevant documents which apply to the conduct of biobanks in two areas: support for open access and the protection of data subjects and researchers managing a bioresource.

We present examples of decision making within a biobank based upon observations of the Generation Scotland Access Committee. We reflect upon how the drive towards open access raises ethical dilemmas for established biorepositories containing data and samples from human subjects.

Results

Despite much discussion in science policy literature about standardisation, the contextual aspects of biobanking are often overlooked. Using our engagement with GS we demonstrate the importance of local arrangements in the creation of a responsive ethical approach to biorepository governance.

We argue that governance decisions regarding access to the biobank are intertwined with considerations about maintenance and viability at the local level. We show that in addition to the focus upon ever more universal and standardised practices, the local expertise gained in the management of such repositories must be supported.

Conclusions

A commitment to open access in genomics research has found almost universal backing in science and health policy circles, but repositories of data and samples from human subjects may have to operate under managed access, to protect privacy, align with participant consent and ensure that the resource can be managed in a sustainable way.

Data access committees need to be reflexive and flexible, to cope with changing technology and opportunities and threats from the wider data sharing environment. To understand these interactions also involves nurturing what is particular about the biobank in its local context.

URL : Balancing the local and the universal in maintaining ethical access to a genomics biobank

DOI : https://doi.org/10.1186/s12910-017-0240-7

Biotea: semantics for Pubmed Central

Authors : Alexander Garcia​, Federico Lopez, Leyla Garcia, Olga Giraldo, Victor Bucheli, Michel Dumontier

A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies.

In this paper we present the second version of Biotea, a semantic, linked data version of the open-access subset of PubMed Central that has been enhanced with specialized annotation pipelines that uses existing infrastructure from the National Center for Biomedical Ontology.

We expose our models, services, software and datasets. Our infrastructure enables manual and semi-automatic annotation, resulting data are represented as RDF-based linked data and can be readily queried using the SPARQL query language.

We illustrate the utility of our system with several use cases. Our datasets, methods and techniques are available at http://biotea.github.io.

URL : Biotea: semantics for Pubmed Central

DOI : https://doi.org/10.7717/peerj.4201

Completeness and overlap in open access systems: Search engines, aggregate institutional repositories and physics-related open sources

Authors : Ming-yueh Tsay, Tai-luan Wu, Ling-li Tseng

This study examines the completeness and overlap of coverage in physics of six open access scholarly communication systems, including two search engines (Google Scholar and Microsoft Academic), two aggregate institutional repositories (OAIster and OpenDOAR), and two physics-related open sources (arXiv.org and Astrophysics Data System).

The 2001–2013 Nobel Laureates in Physics served as the sample. Bibliographic records of their publications were retrieved and downloaded from each system, and a computer program was developed to perform the analytical tasks of sorting, comparison, elimination, aggregation and statistical calculations.

Quantitative analyses and cross-referencing were performed to determine the completeness and overlap of the system coverage of the six open access systems.

The results may enable scholars to select an appropriate open access system as an efficient scholarly communication channel, and academic institutions may build institutional repositories or independently create citation index systems in the future. Suggestions on indicators and tools for academic assessment are presented based on the comprehensiveness assessment of each system.

URL : Completeness and overlap in open access systems: Search engines, aggregate institutional repositories and physics-related open sources

DOI : https://doi.org/10.1371/journal.pone.0189751

Blockchains, Orphan Works, and the Public Domain

Authors : Jake Goldenfein, Dan Hunter

This Article outlines a blockchain based system to solve the orphan works problem. Orphan works are works still ostensibly protected by copyright for which an author cannot be found.

Orphan works represent a significant problem for the efficient dissemination of knowledge, since users cannot license the works, and as a result may choose not to use them. Our proposal uses a blockchain to register attempts to find the authors of orphan works, and otherwise to facilitate use of those works.

There are three elements to our proposal. First, we propose a number of mechanisms, included automated systems, to perform a diligent search for a rights holder. Second, we propose a blockchain register where every search for a work’s owner can be recorded. Third, we propose a legal mechanism that delivers works into orphanhood, and affords a right to use those works after a search for a rights holder is deemed diligent.

These changes would provide any user of an orphan work with an assurance that they were acting legally as long as they had consulted the register and/or performed a diligent search for the work’s owner.

The Article demonstrates a range of complementary legal and technological architectures that, in various formations, can be deployed to address the orphan works problem. We show that these technological systems are useful for enhancement of the public domain more generally, through the existence of a growing registry of gray status works and clarified conditions for their use.

The selection and design of any particular implementation is a choice for policy makers and technologists. Rather than specify how that choice should look, the goal here is to demonstrate the utility of the technology and to clarify and promote its role in reforming this vexed area of law.

URL : Blockchains, Orphan Works, and the Public Domain

Alternative location : https://lawandarts.org/article/blockchains-orphan-works-and-the-public-domain/

Assessing Research Data Deposits and Usage Statistics within IDEALS

Author : Christie A. Wiley

Objectives

This study follows up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1) What is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign (UIUC) campus repository? Are datasets more likely to be single-file or multiple-file items? (2) What is the usage data associated with these datasets? Which items are most popular?

Methods

The dataset records collected in this study were identified by filtering item types categorized as “data” or “dataset” using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item’s statistics report.

The Handle identifier represents the dataset record’s persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository.

Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS.

Results

A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first timeframe a large number of PDFs were deposited by the Illinois Department of Agriculture.

Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single-file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per file across all datasets averaged 3.2.

Conclusion

Academic librarians, repository managers, and research data services staff can use the results presented here to anticipate the nature of research data that may be deposited within institutional repositories.

With increased awareness, content recruitment, and improvements, IRs can provide a viable cyberinfrastructure for researchers to deposit data, but much can be learned from the data already deposited.

Awareness of trends can help librarians facilitate discussions with researchers about research data deposits as well as better tailor their services to address short-term and long-term research needs.

URL : Assessing Research Data Deposits and Usage Statistics within IDEALS

DOI : https://doi.org/10.7191/jeslib.2017.1112

Documentation and Visualisation of Workflows for Effective Communication, Collaboration and Publication @ Source

Authors : Cerys Willoughby, Jeremy G. Frey

Workflows processing data from research activities and driving in silico experiments are becoming an increasingly important method for conducting scientific research. Workflows have the advantage that not only can they be automated and used to process data repeatedly, but they can also be reused – in part or whole – enabling them to be evolved for use in new experiments.

A number of studies have investigated strategies for storing and sharing workflows for the benefit of reuse. These have revealed that simply storing workflows in repositories without additional context does not enable workflows to be successfully reused.

These studies have investigated what additional resources are needed to facilitate users of workflows and in particular to add provenance traces and to make workflows and their resources machine-readable.

These additions also include adding metadata for curation, annotations for comprehension, and including data sets to provide additional context to the workflow. Ultimately though, these mechanisms still rely on researchers having access to the software to view and run the workflows.

We argue that there are situations where researchers may want to understand a workflow that goes beyond what provenance traces provide and without having to run the workflow directly; there are many situations in which it can be difficult or impossible to run the original workflow.

To that end, we have investigated the creation of an interactive workflow visualization that captures the flow chart element of the workflow with additional context including annotations, descriptions, parameters, metadata and input, intermediate, and results data that can be added to the record of a workflow experiment to enhance both curation and add value to enable reuse.

We have created interactive workflow visualisations for the popular workflow creation tool KNIME, which does not provide users with an in-built function to extract provenance information that can otherwise only be viewed through the tool itself.

Making use of the strengths of KNIME for adding documentation and user-defined metadata we can extract and create a visualisation and curation package that encourages and enhances curation@source, facilitating effective communication, collaboration, and reuse of workflows.

URL : Documentation and Visualisation of Workflows for Effective Communication, Collaboration and Publication @ Source

DOI : https://doi.org/10.2218/ijdc.v12i1.532