Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

Authors : Harshdeep Singh, Robert West, Giovanni Colavizza

Wikipedia’s contents are based on reliable and published sources. To this date, little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive dataset of citations extracted from Wikipedia.

A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers — including DOI, PMC, PMID, and ISBN — and further labeled an extra 261K citations with DOIs from Crossref.

As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI. Scientific articles cited from Wikipedia correspond to 3.5% of all articles with a DOI currently indexed in the Web of Science. We release all our code to allow the community to extend upon our work and update the dataset in the future.

URL : https://arxiv.org/abs/2007.07022

Why We Read Wikipedia

Authors : Philipp Singer, Florian Lemmerich, Robert West, Leila Zia, Ellery Wulczyn, Markus Strohmaier, Jure Leskovec

Wikipedia is one of the most popular sites on the Web, with millions of users relying on it to satisfy a broad range of information needs every day. Although it is crucial to understand what exactly these needs are in order to be able to meet them, little is currently known about why users visit Wikipedia.

The goal of this paper is to fill this gap by combining a survey of Wikipedia readers with a log-based analysis of user activity.

Based on an initial series of user surveys, we build a taxonomy of Wikipedia use cases along several dimensions, capturing users’ motivations to visit Wikipedia, the depth of knowledge they are seeking, and their knowledge of the topic of interest prior to visiting Wikipedia.

Then, we quantify the prevalence of these use cases via a large-scale user survey conducted on live Wikipedia with almost 30,000 responses.

Our analyses highlight the variety of factors driving users to Wikipedia, such as current events, media coverage of a topic, personal curiosity, work or school assignments, or boredom.

Finally, we match survey responses to the respondents’ digital traces in Wikipedia’s server logs, enabling the discovery of behavioral patterns associated with specific use cases.

For instance, we observe long and fast-paced page sequences across topics for users who are bored or exploring randomly, whereas those using Wikipedia for work or school spend more time on individual articles focused on topics such as science.

Our findings advance our understanding of reader motivations and behavior on Wikipedia and can have implications for developers aiming to improve Wikipedia’s user experience, editors striving to cater to their readers’ needs, third-party services (such as search engines) providing access to Wikipedia content, and researchers aiming to build tools such as recommendation engines.

URL : Why We Read Wikipedia

Alternative location : https://arxiv.org/abs/1702.05379