Semantic representation and enrichment of information retrieval experimental data

Authors : Gianmaria Silvello, Georgeta Bordea, Nicola Ferro, Paul Buitelaar, Toine Bogers

Experimental evaluation carried out in international large-scale campaigns is a fundamental pillar of the scientific and technological advancement of information retrieval (IR) systems.

Such evaluation activities produce a large quantity of scientific and experimental data, which are the foundation for all the subsequent scientific production and development of new systems.

In this work, we discuss how to semantically annotate and interlink this data, with the goal of enhancing their interpretation, sharing, and reuse. We discuss the underlying evaluation workflow and propose a resource description framework model for those workflow parts.

We use expertise retrieval as a case study to demonstrate the benefits of our semantic representation approach. We employ this model as a means for exposing experimental data as linked open data (LOD) on the Web and as a basis for enriching and automatically connecting this data with expertise topics and expert profiles.

In this context, a topic-centric approach for expert search is proposed, addressing the extraction of expertise topics, their semantic grounding with the LOD cloud, and their connection to IR experimental data.

Several methods for expert profiling and expert finding are analysed and evaluated. Our results show that it is possible to construct expert profiles starting from automatically extracted expertise topics and that topic-centric approaches outperform state-of-the-art language modelling approaches for expert finding.

URL : https://aran.library.nuigalway.ie/handle/10379/5862

Experiences in integrated data and research object publishing using GigaDB

Authors : Scott C Edmunds, Peter Li, Christopher I Hunter, Si Zhe Xiao, Robert L Davidson, Nicole Nogoy, Laurie Goodman

In the era of computation and data-driven research, traditional methods of disseminating research are no longer fit-for-purpose. New approaches for disseminating data, methods and results are required to maximize knowledge discovery.

The “long tail” of small, unstructured datasets is well catered for by a number of general-purpose repositories, but there has been less support for “big data”. Outlined here are our experiences in attempting to tackle the gaps in publishing large-scale, computationally intensive research.

GigaScience is an open-access, open-data journal aiming to revolutionize large-scale biological data dissemination, organization and re-use. Through use of the data handling infrastructure of the genomics centre BGI, GigaScience links standard manuscript publication with an integrated database (GigaDB) that hosts all associated data, and provides additional data analysis tools and computing resources.

Furthermore, the supporting workflows and methods are also integrated to make published articles more transparent and open. GigaDB has released many new and previously unpublished datasets and data types, including as urgently needed data to tackle infectious disease outbreaks, cancer and the growing food crisis.

Other “executable” research objects, such as workflows, virtual machines and software from several GigaScience articles have been archived and shared in reproducible, transparent and usable formats.

With data citation producing evidence of, and credit for, its use in the wider research community, GigaScience demonstrates a move towards more executable publications. Here data analyses can be reproduced and built upon by users without coding backgrounds or heavy computational infrastructure in a more democratized manner.

URL : Experiences in integrated data and research object publishing using GigaDB

DOI : http://link.springer.com/article/10.1007/s00799-016-0174-6

Advancing research data publishing practices for the social sciences: from archive activity to empowering researchers

Authors : Veerle Van den Eynden, Louise Corti

Sharing and publishing social science research data have a long history in the UK, through long-standing agreements with government agencies for sharing survey data and the data policy, infrastructure, and data services supported by the Economic and Social Research Council.

The UK Data Service and its predecessors developed data management, documentation, and publishing procedures and protocols that stand today as robust templates for data publishing.

As the ESRC research data policy requires grant holders to submit their research data to the UK Data Service after a grant ends, setting standards and promoting them has been essential in raising the quality of the resulting research data being published. In the past, received data were all processed, documented, and published for reuse in-house.

Recent investments have focused on guiding and training researchers in good data management practices and skills for creating shareable data, as well as a self-publishing repository system, ReShare. ReShare also receives data sets described in published data papers and achieves scientific quality assurance through peer review of submitted data sets before publication.

Social science data are reused for research, to inform policy, in teaching and for methods learning. Over a 10 years period, responsive developments in system workflows, access control options, persistent identifiers, templates, and checks, together with targeted guidance for researchers, have helped raise the standard of self-publishing social science data.

Lessons learned and developments in shifting publishing social science data from an archivist responsibility to a researcher process are showcased, as inspiration for institutions setting up a data repository.

URL : Advancing research data publishing practices for the social sciences: from archive activity to empowering researchers

DOI : doi:10.1007/s00799-016-0177-3

TrueReview: A Platform for Post-Publication Peer Review

Authors : Luca de Alfaro, Marco Faella

In post-publication peer review, scientific contributions are first published in open-access forums, such as arXiv or other digital libraries, and are subsequently reviewed and possibly ranked and/or evaluated.

Compared to the classical process of scientific publishing, in which review precedes publication, post-publication peer review leads to faster dissemination of ideas, and publicly-available reviews. The chief concern in post-publication reviewing consists in eliciting high-quality, insightful reviews from participants.

We describe the mathematical foundations and structure of TrueReview, an open-source tool we propose to build in support of post-publication review.

In TrueReview, the motivation to review is provided via an incentive system that promotes reviews and evaluations that are both truthful (they turn out to be correct in the long run) and informative (they provide significant new information).

TrueReview organizes papers in venues, allowing different scientific communities to set their own submission and review policies. These venues can be manually set-up, or they can correspond to categories in well-known repositories such as arXiv.

The review incentives can be used to form a reviewer ranking that can be prominently displayed alongside papers in the various disciplines, thus offering a concrete benefit to reviewers. The paper evaluations, in turn, reward the authors of the most significant papers, both via an explicit paper ranking, and via increased visibility in search.

URL : https://arxiv.org/abs/1608.07878

 

Understanding the Impact of Early Citers on Long-Term Scientific Impact

Authors : Mayank Singh, Ajay Jaiswal, Priya Shree, Arindam Pal, Animesh Mukherjee, Pawan Goyal

This paper explores an interesting new dimension to the challenging problem of predicting long-term scientific impact (LTSI) usually measured by the number of citations accumulated by a paper in the long-term.

It is well known that early citations (within 1-2 years after publication) acquired by a paper positively affects its LTSI. However, there is no work that investigates if the set of authors who bring in these early citations to a paper also affect its LTSI.

In this paper, we demonstrate for the first time, the impact of these authors whom we call early citers (EC) on the LTSI of a paper. Note that this study of the complex dynamics of EC introduces a brand new paradigm in citation behavior analysis.

Using a massive computer science bibliographic dataset we identify two distinct categories of EC – we call those authors who have high overall publication/citation count in the dataset as influential and the rest of the authors as non-influential.

We investigate three characteristic properties of EC and present an extensive analysis of how each category correlates with LTSI in terms of these properties. In contrast to popular perception, we find that influential EC negatively affects LTSI possibly owing to attention stealing.

To motivate this, we present several representative examples from the dataset. A closer inspection of the collaboration network reveals that this stealing effect is more profound if an EC is nearer to the authors of the paper being investigated.

As an intuitive use case, we show that incorporating EC properties in the state-of-the-art supervised citation prediction models leads to high performance margins.

At the closing, we present an online portal to visualize EC statistics along with the prediction results for a given query paper.

URL : https://arxiv.org/abs/1705.03178

Open data, [open] access: linking data sharing and article sharing in the Earth Sciences

Author : Samantha Teplitzky

Introduction

The norms of a research community influence practice, and norms of openness and sharing can be shaped to encourage researchers who share in one aspect of their research cycle to share in another.

Different sets of mandates have evolved to require that research data be made public, but not necessarily articles resulting from that collected data. In this paper, I ask to what extent publications in the Earth Sciences are more likely to be open access (in all of its definitions) when researchers open their data through the Pangaea repository.

Methods

Citations from Pangaea data sets were studied to determine the level of open access for each article.

Results

This study finds that the proportion of gold open access articles linked to the repository increased 25% from 2010 to 2015 and 75% of articles were available from multiple open sources.

Discussion

The context for increased preference for gold open access is considered and future work linking researchers’ decisions to open their work to the adoption of open access mandates is proposed.

URL : Open data, [open] access: linking data sharing and article sharing in the Earth Sciences

DOI : http://doi.org/10.7710/2162-3309.2150