Wikipedia Text Reuse: Within and Without

Authors : Milad Alshomary, Michael Völske, Tristan Licht, Henning Wachsmuth, Benno Stein, Matthias Hagen, Martin Potthast

We study text reuse related to Wikipedia at scale by compiling the first corpus of text reuse cases within Wikipedia as well as without (i.e., reuse of Wikipedia text in a sample of the Common Crawl).

To discover reuse beyond verbatim copy and paste, we employ state-of-the-art text reuse detection technology, scaling it for the first time to process the entire Wikipedia as part of a distributed retrieval pipeline.

We further report on a pilot analysis of the 100 million reuse cases inside, and the 1.6 million reuse cases outside Wikipedia that we discovered. Text reuse inside Wikipedia gives rise to new tasks such as article template induction, fixing quality flaws due to inconsistencies arising from asynchronous editing of reused passages, or complementing Wikipedia’s ontology.

Text reuse outside Wikipedia yields a tangible metric for the emerging field of quantifying Wikipedia’s influence on the web. To foster future research into these tasks, and for reproducibility’s sake, the Wikipedia text reuse corpus and the retrieval pipeline are made freely available.


Wikipedia: an opportunity to rethink the links between sources’ credibility, trust and authority

Authors : Gilles Sahut, André Tricot

The Web and its main tools (Google, Wikipedia, Facebook, Twitter) deeply raise and renew fundamental questions, that everyone asks almost every day: Is this information or content true? Can I trust this author or source?

These questions are not new, they have been the same with books, newspapers, broadcasting and television, and, more fundamentally, in every human interpersonal communication.

This paper is focused on two scientific problems on this issue. The first one is theoretical: to address this issue, many concepts have been used in library and information sciences, communication and psychology.

The links between these concepts are not clear: sometimes two concepts are considered as synonymous, sometimes as very different. The second one is historical: sources like Wikipedia deeply challenge the epistemic evaluation of information sources, compared to previous modes of information production.

This paper proposes an integrated and simple model considering the relation between a user, a document and an author as human communication. It reduces the problem to three concepts: credibility as a characteristic granted to information depending on its truth-value; trust as the ability to produce credible information; authority when the power to influence of an author is accepted, i.e., when readers accept that the source can modify their opinion, knowledge and decisions.

The model describes also two kinds of relationships between the three concepts: an upward link and a downward link. The model is confronted with findings of empirical research on Wikipedia in particular.


The Evolution of the Concept of Semantic Web in the Context of Wikipedia: An Exploratory Approach to Study the Collective Conceptualization in a Digital Collaborative Environment

Authors : Luís Miguel Machado, Maria Manuel Borges, Renato Rocha Souza

Wikipedia, as a “social machine”, is a privileged place to observe the collective construction of concepts without central control. Based on Dahlberg’s theory of concept, and anchored in the pragmatism of Hjørland—in which the concepts are socially negotiated meanings—the evolution of the concept of semantic web (SW) was analyzed in the English version of Wikipedia.

An exploratory, descriptive, and qualitative study was designed and we identified 26 different definitions (between 12 July 2001 and 31 December 2017), of which eight are of particular relevance for their duration, with the latter being the two recorded at the end of the analyzed period.

According to them, SW: “is an extension of the web” and “is a Web of Data”; the latter, used as a complementary definition, links to Berners-Lee’s publications. In Wikipedia, the evolution of the SW concept appears to be based on the search for the use of non-technical vocabulary and the control of authority carried out by the debate.

As a space for collective bargaining of meanings, the Wikipedia study may bring relevant contributions to a community’s understanding of a particular concept and how it evolves over time.

URL : The Evolution of the Concept of Semantic Web in the Context of Wikipedia: An Exploratory Approach to Study the Collective Conceptualization in a Digital Collaborative Environment


La gouvernance de Wikipédia : élaboration de règles et théorie d’Ostrom

Auteur/Author : Gilles Sahut

La réussite de Wikipédia est fréquemment attribuée à la pertinence de sa gouvernance. Toutefois, il n’existe pas de consensus scientifique pour la caractériser.

Dans cette étude empirique, nous nous penchons sur une facette de cette gouvernance au sein de la Wikipédia francophone : les modalités de construction de deux règles liées à la citation des sources.

Elles sont étudiées au travers de la théorie d’Ostrom sur les communs. Nous montrons que ces règles sont discutées et écrites par une minorité de contributeurs particulièrement impliqués. Ainsi, il n’y a pas, dans Wikipédia, de « classe politique » coupée du terrain.

Nous soulignons également l’influence du dispositif communicationnel interne sur ce processus ainsi que celle de la Wikipédia anglophone.

URL : La gouvernance de Wikipédia : élaboration de règles et théorie d’Ostrom

Alternative location :

Science Is Shaped by Wikipedia: Evidence From a Randomized Control Trial

Authors : Neil Thompson, Douglas Hanley

“I sometimes think that general and popular treatises are almost as important for the progress of science as original work.” – Charles Darwin, 1865.

As the largest encyclopedia in the world, it is not surprising that Wikipedia reflects the state of scientific knowledge. However, Wikipedia is also one of the most accessed websites in the world, including by scientists, which suggests that it also has the potential to shape science. This paper shows that it does.

Incorporating ideas into Wikipedia leads to those ideas being used more in the scientific literature. We provide correlational evidence of this across thousands of Wikipedia articles and causal evidence of it through a randomized control trial where we add new scientific content to Wikipedia.

We find that the causal impact is strong, with Wikipedia influencing roughly one in every ∼830 words in related scientific journal articles. We also find causal evidence that the scientific articles referenced in Wikipedia receive more citations, suggesting that Wikipedia complements the traditional journal system by pointing researchers to key underlying scientific articles.

Our findings speak not only to the influence of Wikipedia, but more broadly to the influence of repositories of scientific knowledge and the role that they play in the creation of scientific knowledge.


Wikipedia in higher education: Changes in perceived value through content contribution

Authors : Joan Soler-Adillon, Dragana Pavlovic

Wikipedia is a widely used resource by university students, but it is not necessarily regarded as being reliable and trustworthy by them, nor is it seen as a context in which to make content contributions.

This paper presents a teaching and research project that consisted in having students edit or create Wikipedia articles and testing whether or not this experience changed their perceived value of the platform. We conducted our experience at Universitat Pompeu Fabra (Barcelona, Spain) and University of Niš (Niš, Serbia) with a total number of 240 students.

These students edited articles and answered two questionnaires, one before and one after the exercise. We compared the pre and post experience answers to the questionnaires with a series of paired samples ttests, through which our data showed that students did significantly change their perception of reliability and usefulness, and of likeliness of finding false information on Wikipedia.

Their appreciation of the task of writing Wikipedia articles, in terms of it being interesting and challenge also increased. They did not significantly change, however, their judgement on the social value of the platform, neither in the university nor in the general context.

In addition, the open questions and informal feedback allowed us to gather valuable insights towards the evaluation of the overall experience.

URL : Wikipedia in higher education: Changes in perceived value through content contribution

Alternative location :



Evolution of Wikipedia’s medical content: past, present and future

Authors : Thomas Shafee, Gwinyai Masukume, Lisa Kipersztok, Diptanshu Da, Mikael Häggström, James Heilman

As one of the most commonly read online sources of medical information, Wikipedia is an influential public health platform. Its medical content, community, collaborations and challenges have been evolving since its creation in 2001, and engagement by the medical community is vital for ensuring its accuracy and completeness.

Both the encyclopaedia’s internal metrics as well as external assessments of its quality indicate that its articles are highly variable, but improving. Although content can be edited by anyone, medical articles are primarily written by a core group of medical professionals.

Diverse collaborative ventures have enhanced medical article quality and reach, and opportunities for partnerships are more available than ever. Nevertheless, Wikipedia’s medical content and community still face significant challenges, and a socioecological model is used to structure specific recommendations.

We propose that the medical community should prioritise the accuracy of biomedical information in the world’s most consulted encyclopaedia.

URL : Evolution of Wikipedia’s medical content: past, present and future

Alternative location :