Citations are not opinions: a corpus linguistics approach to understanding how citations are made

Author : Domenic Rosati

Citation content analysis seeks to understand citations based on the language used during the making of a citation. A key issue in citation content analysis is looking for linguistic structures that characterize distinct classes of citations for the purposes of understanding the intent and function of a citation.

Previous works have focused on modeling linguistic features first and drawn conclusions on the language structures unique to each class of citation function based on the performance of a classification task or inter-annotator agreement.

In this study, we start with a large sample of a pre-classified citation corpus, 2 million citations from each class of the scite Smart Citation dataset (supporting, disputing, and mentioning citations), and analyze its corpus linguistics in order to reveal the unique and statistically significant language structures belonging to each type of citation.

By generating comparison tables for each citation type we present a number of interesting linguistic features that uniquely characterize citation type. What we find is that within citation collocates, there is very low correlation between citation type and sentiment.

Additionally, we find that the subjectivity of citation collocates across classes is very low. These findings suggest that the sentiment of collocates is not a predictor of citation function and that due to their low subjectivity, an opinion-expressing mode of understanding citations, implicit in previous citation sentiment analysis literature, is inappropriate.

Instead, we suggest that citations can be better understood as claims-making devices where the citation type can be explained by understanding how two claims are being compared. By presenting this approach, we hope to inspire similar corpus linguistic studies on citations that derive a more robust theory of citation from an empirical basis using citation corpora.