Biotea: semantics for Pubmed Central

Authors : Alexander Garcia​, Federico Lopez, Leyla Garcia, Olga Giraldo, Victor Bucheli, Michel Dumontier

A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies.

In this paper we present the second version of Biotea, a semantic, linked data version of the open-access subset of PubMed Central that has been enhanced with specialized annotation pipelines that uses existing infrastructure from the National Center for Biomedical Ontology.

We expose our models, services, software and datasets. Our infrastructure enables manual and semi-automatic annotation, resulting data are represented as RDF-based linked data and can be readily queried using the SPARQL query language.

We illustrate the utility of our system with several use cases. Our datasets, methods and techniques are available at http://biotea.github.io.

URL : Biotea: semantics for Pubmed Central

DOI : https://doi.org/10.7717/peerj.4201

Completeness and overlap in open access systems: Search engines, aggregate institutional repositories and physics-related open sources

Authors : Ming-yueh Tsay, Tai-luan Wu, Ling-li Tseng

This study examines the completeness and overlap of coverage in physics of six open access scholarly communication systems, including two search engines (Google Scholar and Microsoft Academic), two aggregate institutional repositories (OAIster and OpenDOAR), and two physics-related open sources (arXiv.org and Astrophysics Data System).

The 2001–2013 Nobel Laureates in Physics served as the sample. Bibliographic records of their publications were retrieved and downloaded from each system, and a computer program was developed to perform the analytical tasks of sorting, comparison, elimination, aggregation and statistical calculations.

Quantitative analyses and cross-referencing were performed to determine the completeness and overlap of the system coverage of the six open access systems.

The results may enable scholars to select an appropriate open access system as an efficient scholarly communication channel, and academic institutions may build institutional repositories or independently create citation index systems in the future. Suggestions on indicators and tools for academic assessment are presented based on the comprehensiveness assessment of each system.

URL : Completeness and overlap in open access systems: Search engines, aggregate institutional repositories and physics-related open sources

DOI : https://doi.org/10.1371/journal.pone.0189751

Blockchains, Orphan Works, and the Public Domain

Authors : Jake Goldenfein, Dan Hunter

This Article outlines a blockchain based system to solve the orphan works problem. Orphan works are works still ostensibly protected by copyright for which an author cannot be found.

Orphan works represent a significant problem for the efficient dissemination of knowledge, since users cannot license the works, and as a result may choose not to use them. Our proposal uses a blockchain to register attempts to find the authors of orphan works, and otherwise to facilitate use of those works.

There are three elements to our proposal. First, we propose a number of mechanisms, included automated systems, to perform a diligent search for a rights holder. Second, we propose a blockchain register where every search for a work’s owner can be recorded. Third, we propose a legal mechanism that delivers works into orphanhood, and affords a right to use those works after a search for a rights holder is deemed diligent.

These changes would provide any user of an orphan work with an assurance that they were acting legally as long as they had consulted the register and/or performed a diligent search for the work’s owner.

The Article demonstrates a range of complementary legal and technological architectures that, in various formations, can be deployed to address the orphan works problem. We show that these technological systems are useful for enhancement of the public domain more generally, through the existence of a growing registry of gray status works and clarified conditions for their use.

The selection and design of any particular implementation is a choice for policy makers and technologists. Rather than specify how that choice should look, the goal here is to demonstrate the utility of the technology and to clarify and promote its role in reforming this vexed area of law.

URL : Blockchains, Orphan Works, and the Public Domain

Alternative location : https://lawandarts.org/article/blockchains-orphan-works-and-the-public-domain/

Assessing Research Data Deposits and Usage Statistics within IDEALS

Author : Christie A. Wiley

Objectives

This study follows up on previous work that began examining data deposited in an institutional repository. The work here extends the earlier study by answering the following lines of research questions: (1) What is the file composition of datasets ingested into the University of Illinois at Urbana-Champaign (UIUC) campus repository? Are datasets more likely to be single-file or multiple-file items? (2) What is the usage data associated with these datasets? Which items are most popular?

Methods

The dataset records collected in this study were identified by filtering item types categorized as “data” or “dataset” using the advanced search function in IDEALS. Returned search results were collected in an Excel spreadsheet to include data such as the Handle identifier, date ingested, file formats, composition code, and the download count from the item’s statistics report.

The Handle identifier represents the dataset record’s persistent identifier. Composition represents codes that categorize items as single or multiple file deposits. Date available represents the date the dataset record was published in the campus repository.

Download statistics were collected via a website link for each dataset record and indicates the number of times the dataset record has been downloaded. Once the data was collected, it was used to evaluate datasets deposited into IDEALS.

Results

A total of 522 datasets were identified for analysis covering the period between January 2007 and August 2016. This study revealed two influxes occurring during the period of 2008-2009 and in 2014. During the first timeframe a large number of PDFs were deposited by the Illinois Department of Agriculture.

Whereas, Microsoft Excel files were deposited in 2014 by the Rare Books and Manuscript Library. Single-file datasets clearly dominate the deposits in the campus repository. The total download count for all datasets was 139,663 and the average downloads per month per file across all datasets averaged 3.2.

Conclusion

Academic librarians, repository managers, and research data services staff can use the results presented here to anticipate the nature of research data that may be deposited within institutional repositories.

With increased awareness, content recruitment, and improvements, IRs can provide a viable cyberinfrastructure for researchers to deposit data, but much can be learned from the data already deposited.

Awareness of trends can help librarians facilitate discussions with researchers about research data deposits as well as better tailor their services to address short-term and long-term research needs.

URL : Assessing Research Data Deposits and Usage Statistics within IDEALS

DOI : https://doi.org/10.7191/jeslib.2017.1112

Documentation and Visualisation of Workflows for Effective Communication, Collaboration and Publication @ Source

Authors : Cerys Willoughby, Jeremy G. Frey

Workflows processing data from research activities and driving in silico experiments are becoming an increasingly important method for conducting scientific research. Workflows have the advantage that not only can they be automated and used to process data repeatedly, but they can also be reused – in part or whole – enabling them to be evolved for use in new experiments.

A number of studies have investigated strategies for storing and sharing workflows for the benefit of reuse. These have revealed that simply storing workflows in repositories without additional context does not enable workflows to be successfully reused.

These studies have investigated what additional resources are needed to facilitate users of workflows and in particular to add provenance traces and to make workflows and their resources machine-readable.

These additions also include adding metadata for curation, annotations for comprehension, and including data sets to provide additional context to the workflow. Ultimately though, these mechanisms still rely on researchers having access to the software to view and run the workflows.

We argue that there are situations where researchers may want to understand a workflow that goes beyond what provenance traces provide and without having to run the workflow directly; there are many situations in which it can be difficult or impossible to run the original workflow.

To that end, we have investigated the creation of an interactive workflow visualization that captures the flow chart element of the workflow with additional context including annotations, descriptions, parameters, metadata and input, intermediate, and results data that can be added to the record of a workflow experiment to enhance both curation and add value to enable reuse.

We have created interactive workflow visualisations for the popular workflow creation tool KNIME, which does not provide users with an in-built function to extract provenance information that can otherwise only be viewed through the tool itself.

Making use of the strengths of KNIME for adding documentation and user-defined metadata we can extract and create a visualisation and curation package that encourages and enhances curation@source, facilitating effective communication, collaboration, and reuse of workflows.

URL : Documentation and Visualisation of Workflows for Effective Communication, Collaboration and Publication @ Source

DOI : https://doi.org/10.2218/ijdc.v12i1.532

Research Transparency: A Preliminary Study of Disciplinary Conceptualisation, Drivers, Tools and Support Services

Authors : Liz Lyon, Wei Jeng, Eleanor Mattern

This paper describes a preliminary study of research transparency, which draws on the findings from four focus group sessions with faculty in chemistry, law, urban and social studies, and civil and environmental engineering.

The multi-faceted nature of transparency is highlighted by the broad ways in which the faculty conceptualised the concept (data sharing, ethics, replicability) and the vocabulary they used with common core terms identified (data, methods, full disclosure).

The associated concepts of reproducibility and trust are noted. The research lifecycle stages are used as a foundation to identify the action verbs and software tools associated with transparency.

A range of transparency drivers and motivations are listed. The role of libraries and data scientists is discussed in the context of the provision of transparency services for researchers.

URL : Research Transparency: A Preliminary Study of Disciplinary Conceptualisation, Drivers, Tools and Support Services

DOI : https://doi.org/10.2218/ijdc.v12i1.530

Amplifying Data Curation Efforts to Improve the Quality of Life Science Data

Authors : Mariam Alqasab, Suzanne M. Embury, Sandra de F. Mendes Sampaio

In the era of data science, datasets are shared widely and used for many purposes unforeseen by the original creators of the data. In this context, defects in datasets can have far reaching consequences, spreading from dataset to dataset, and affecting the consumers of data in ways that are hard to predict or quantify.

Some form of waste is often the result. For example, scientists using defective data to propose hypotheses for experimentation may waste their limited wet lab resources chasing the wrong experimental targets. Scarce drug trial resources may be used to test drugs that actually have little chance of giving a cure.

Because of the potential real world costs, database owners care about providing high quality data. Automated curation tools can be used to an extent to discover and correct some forms of defect.

However, in some areas human curation, performed by highly-trained domain experts, is needed to ensure that the data represents our current interpretation of reality accurately.

Human curators are expensive, and there is far more curation work to be done than there are curators available to perform it. Tools and techniques are needed to enable the full value to be obtained from the curation effort currently available.

In this paper,we explore one possible approach to maximising the value obtained from human curators, by automatically extracting information about data defects and corrections from the work that the curators do.

This information is packaged in a source independent form, to allow it to be used by the owners of other databases (for which human curation effort is not available or is insufficient).

This amplifies the efforts of the human curators, allowing their work to be applied to other sources, without requiring any additional effort or change in their processes or tool sets. We show that this approach can discover significant numbers of defects, which can also be found in other sources.

URL : Amplifying Data Curation Efforts to Improve the Quality of Life Science Data

DOI : https://doi.org/10.2218/ijdc.v12i1.495