DataMed – an open source discovery index for finding biomedical datasets

Authors : Xiaoling Chen, Anupama E Gururaj, Burak Ozyurt, Ruiling Liu, Ergin Soysal, Trevor Cohen, Firat Tiryaki, Yueling Li, Nansu Zong, Min Jiang, Deevakar Rogith, Mandana Salimi, Hyeon-eui Kim, Philippe Rocca-Serra, Alejandra Gonzalez-Beltran, Claudiu Farcas, Todd Johnson, Ron Margolis, George Alter, Susanna-Assunta Sansone, Ian M Fore, Lucila Ohno-Machado, Jeffrey S Grethe, Hua Xu


Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain.

Materials and Methods

DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health–funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium.

It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries.

In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine.

Results and Conclusion

Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services.

Currently, we have made the DataMed system publically available as an open source package for the biomedical community.



Balancing the local and the universal in maintaining ethical access to a genomics biobank

Authors : Catherine Heeney, Shona M. Kerr


Issues of balancing data accessibility with ethical considerations and governance of a genomics research biobank, Generation Scotland, are explored within the evolving policy landscape of the past ten years. During this time data sharing and open data access have become increasingly important topics in biomedical research.

Decisions around data access are influenced by local arrangements for governance and practices such as linkage to health records, and the global through policies for biobanking and the sharing of data with large-scale biomedical research data resources and consortia.


We use a literature review of policy relevant documents which apply to the conduct of biobanks in two areas: support for open access and the protection of data subjects and researchers managing a bioresource.

We present examples of decision making within a biobank based upon observations of the Generation Scotland Access Committee. We reflect upon how the drive towards open access raises ethical dilemmas for established biorepositories containing data and samples from human subjects.


Despite much discussion in science policy literature about standardisation, the contextual aspects of biobanking are often overlooked. Using our engagement with GS we demonstrate the importance of local arrangements in the creation of a responsive ethical approach to biorepository governance.

We argue that governance decisions regarding access to the biobank are intertwined with considerations about maintenance and viability at the local level. We show that in addition to the focus upon ever more universal and standardised practices, the local expertise gained in the management of such repositories must be supported.


A commitment to open access in genomics research has found almost universal backing in science and health policy circles, but repositories of data and samples from human subjects may have to operate under managed access, to protect privacy, align with participant consent and ensure that the resource can be managed in a sustainable way.

Data access committees need to be reflexive and flexible, to cope with changing technology and opportunities and threats from the wider data sharing environment. To understand these interactions also involves nurturing what is particular about the biobank in its local context.

URL : Balancing the local and the universal in maintaining ethical access to a genomics biobank


Biotea: semantics for Pubmed Central

Authors : Alexander Garcia​, Federico Lopez, Leyla Garcia, Olga Giraldo, Victor Bucheli, Michel Dumontier

A significant portion of biomedical literature is represented in a manner that makes it difficult for consumers to find or aggregate content through a computational query. One approach to facilitate reuse of the scientific literature is to structure this information as linked data using standardized web technologies.

In this paper we present the second version of Biotea, a semantic, linked data version of the open-access subset of PubMed Central that has been enhanced with specialized annotation pipelines that uses existing infrastructure from the National Center for Biomedical Ontology.

We expose our models, services, software and datasets. Our infrastructure enables manual and semi-automatic annotation, resulting data are represented as RDF-based linked data and can be readily queried using the SPARQL query language.

We illustrate the utility of our system with several use cases. Our datasets, methods and techniques are available at

URL : Biotea: semantics for Pubmed Central


Data Sharing and Cardiology : Platforms and Possibilities

AuthorsPranammya DeyJoseph S. RossJessica D. RitchieNihar R. DesaiSanjeev P. Bhavnani, Harlan M. Krumholz

Sharing deidentified patient-level research data presents immense opportunities to all stakeholders involved in cardiology research and practice. Sharing data encourages the use of existing data for knowledge generation to improve practice, while also allowing for validation of disseminated research.

In this review, we discuss key initiatives and platforms that have helped to accelerate progress toward greater sharing of data. These efforts are being prompted by government, universities, philanthropic sponsors of research, major industry players, and collaborations among some of these entities.

As data sharing becomes a more common expectation, policy changes will be required to encourage and assist data generators with the process of sharing the data they create.

Patients also will need access to their own data and to be empowered to share those data with researchers. Although medicine still lags behind other fields in achieving data sharing’s full potential, cardiology research has the potential to lead the way.



Understanding the Changing Roles of Scientific Publications via Citation Embeddings

Authors : Jiangen He, Chaomei Chen

Researchers may describe different aspects of past scientific publications in their publications and the descriptions may keep changing in the evolution of science. The diverse and changing descriptions (i.e., citation context) on a publication characterize the impact and contributions of the past publication.

In this article, we aim to provide an approach to understanding the changing and complex roles of a publication characterized by its citation context. We described a method to represent the publications’ dynamic roles in science community in different periods as a sequence of vectors by training temporal embedding models.

The temporal representations can be used to quantify how much the roles of publications changed and interpret how they changed.

Our study in the biomedical domain shows that our metric on the changes of publications’ roles is stable over time at the population level but significantly distinguish individuals. We also show the interpretability of our methods by a concrete example.


A review of data sharing statements in observational studies published in the BMJ: A cross-sectional study

Authors : Laura McDonald, Anna Schultze, Alex Simpson, Sophie Graham, Radek Wasiak, Sreeram V. Ramagopalan

In order to understand the current state of data sharing in observational research studies, we reviewed data sharing statements of observational studies published in a general medical journal, the British Medical Journal.

We found that the majority (63%) of observational studies published between 2015 and 2017 included a statement that implied that data used in the study could not be shared. If the findings of our exploratory study are confirmed, room for improvement in the sharing of real-world or observational research data exists.

URL : A review of data sharing statements in observational studies published in the BMJ: A cross-sectional study


Publications en libre accès en biologie–médecine : historique et état des lieux en 2016

Auteurs/Authors : Christophe Boudry, Manuel Durand-Barthez

L’apparition du mouvement « open access » (libre accès, LA) et des archives ouvertes a bouleversé (et bouleverse encore) l’économie et l’accès aux publications scientifiques. L’objectif de cet article est de réactualiser et compléter les résultats des études antérieures qui ont tenté de quantifier l’importance du LA dans le domaine de la biologie/médecine, par le biais d’un focus sur la base de données bibliographiques PubMed.

Une analyse des publications en LA dans PubMed en fonction de l’origine géographique des auteurs a également été menée (pays et continents) et un certain nombre de paramètres liés au LA (évolution du nombre de journaux en LA, nombre de mandats et d’archives ouvertes par pays et continents) ont également été étudiés et mis en perspective. Les résultats mettent en évidence que les pourcentages d’articles dont le texte intégral et disponible en LA ne cessent de progresser et concernent en 2015, 39,1 % des articles disponibles dans PubMed.

L’analyse géographique des 25 pays les plus productifs et des continents montre une grande variabilité concernant le pourcentage d’articles en LA (de 21,9 % pour l’Italie à 42,08 % pour les États-Unis et de 22,80 % pour l’Océanie à 40,84 % pour l’Amérique du Nord).

Par ailleurs, nos données montrent que le nombre de mandats et d’archives ouvertes n’est pas corrélé de manière significative au pourcentage d’articles en LA au niveau national et continental, confirmant ainsi que les politiques publiques successives ou les mandats relatifs au LA n’ont eu qu’une influence, sinon secondaire, du moins inférieure aux attentes.

La mise en place de mandats plus coercitifs parviendra peut-être à obtenir des effets plus significatifs à plus ou moins long terme. L’augmentation régulière du nombre de journaux en LA, concomitante à l’augmentation avérée du nombre de citations des articles en LA, amplifiera certainement encore l’attrait des auteurs pour le LA.