Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

Authors : Harshdeep Singh, Robert West, Giovanni Colavizza

Wikipedia’s contents are based on reliable and published sources. To this date, little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive dataset of citations extracted from Wikipedia.

A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers — including DOI, PMC, PMID, and ISBN — and further labeled an extra 261K citations with DOIs from Crossref.

As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI. Scientific articles cited from Wikipedia correspond to 3.5% of all articles with a DOI currently indexed in the Web of Science. We release all our code to allow the community to extend upon our work and update the dataset in the future.

URL : https://arxiv.org/abs/2007.07022

Wikimedia and universities: contributing to the global commons in the Age of Disinformation

Authors : Martin Poulter, Nick Sheppard

In its first 30 years the world wide web has revolutionized the information environment. However, its impact has been negative as well as positive, through corporate misuse of personal data and due to its potential for enabling the spread of disinformation.

As a large-scale collaborative platform funded through charitable donations, with a mission to provide universal free access to knowledge as a public good, Wikipedia is one of the most popular websites in the world.

This paper explores the role of Wikipedia in the information ecosystem where it occupies a unique role as a bridge between informal discussion and scholarly publication.

We explore how it relates to the broader Wikimedia ecosystem, through structured data on Wikidata for instance, and openly licensed media on Wikimedia Commons.

We consider the potential benefits for universities in the areas of information literacy and research impact, and investigate the extent to which universities in the UK and their libraries are engaging strategically with Wikimedia, if at all.

URL : Wikimedia and universities: contributing to the global commons in the Age of Disinformation

DOI : Wikimedia and universities: contributing to the global commons in the Age of Disinformation

Contribuer à la diffusion du patrimoine documentaire sur Wikipédia : pratiques et enjeux pour les institutions culturelles

Auteurs/Authors : Jessica de Bideran, Romain Wenz

Pour les institutions patrimoniales, la participation à des initiatives ouvertes telles que Wikipédia est un changement de paradigme. Sur la base d’une expérience d’enseignement de plusieurs années en lien avec des structures culturelles conservant du patrimoine documentaire, l’article permet de confronter les réflexions théoriques contemporaines sur la dissémination des connaissances à la réalité des institutions administratives.

L’enjeu pour les acteurs est de concilier les besoins et principes d’une encyclopédie généraliste et mondiale avec les attentes d’institutions historiquement centrées sur la présentation de collections à des visiteurs physiques.

Toutefois, vouloir être présent sur Wikipédia implique pour l’institution culturelle de repenser sa posture vis-à-vis de publics virtuels et non moins collectivement organisés et engagés.

DOI : https://doi.org/10.4000/culturemusees.4762

Science through Wikipedia: A novel representation of open knowledge through co-citation networks

Authors : Wenceslao Arroyo-Machado, Daniel Torres-Salinas, Enrique Herrera-Viedma, Esteban Romero-Frías

This study provides an overview of science from the Wikipedia perspective. A methodology has been established for the analysis of how Wikipedia editors regard science through their references to scientific papers.

The method of co-citation has been adapted to this context in order to generate Pathfinder networks (PFNET) that highlight the most relevant scientific journals and categories, and their interactions in order to find out how scientific literature is consumed through this open encyclopaedia.

In addition to this, their obsolescence has been studied through Price index. A total of 1 433 457 references available at this http URL have been initially taken into account. After pre-processing and linking them to the data from Elsevier’s CiteScore Metrics the sample was reduced to 847 512 references made by 193 802 Wikipedia articles to 598 746 scientific articles belonging to 14 149 journals indexed in Scopus.

As highlighted results we found a significative presence of “Medicine” and “Biochemistry, Genetics and Molecular Biology” papers and that the most important journals are multidisciplinary in nature, suggesting also that high-impact factor journals were more likely to be cited. Furthermore, only 13.44% of Wikipedia citations are to Open Access journals.

URL : https://arxiv.org/abs/2002.04347

Les désaccords éditoriaux dans Wikipédia comme tensions entre régimes épistémiques

Auteurs/Authors : Guillaume Carbou, Gilles Sahut

Malgré son architecture normative élaborée, Wikipédia est le lieu de désaccords récurrents entre contributeurs.

Les auteurs montrent, à partir de l’analyse argumentative d’un corpus des pages de discussion d’articles suscitant de forts débats (OGM, 11 septembre, etc.), que ces désaccords sont en partie sous-tendus par l’existence de « régimes épistémiques » concurrents sur Wikipédia.

Ces régimes épistémiques (encyclopédiste, scientifique, scientiste, wiki, critique et doxique) correspondent à autant de conceptions divergentes du « valide » et des modalités pour y aboutir.

URL : https://journals.openedition.org/communication/10788

Wikipedia Text Reuse: Within and Without

Authors : Milad Alshomary, Michael Völske, Tristan Licht, Henning Wachsmuth, Benno Stein, Matthias Hagen, Martin Potthast

We study text reuse related to Wikipedia at scale by compiling the first corpus of text reuse cases within Wikipedia as well as without (i.e., reuse of Wikipedia text in a sample of the Common Crawl).

To discover reuse beyond verbatim copy and paste, we employ state-of-the-art text reuse detection technology, scaling it for the first time to process the entire Wikipedia as part of a distributed retrieval pipeline.

We further report on a pilot analysis of the 100 million reuse cases inside, and the 1.6 million reuse cases outside Wikipedia that we discovered. Text reuse inside Wikipedia gives rise to new tasks such as article template induction, fixing quality flaws due to inconsistencies arising from asynchronous editing of reused passages, or complementing Wikipedia’s ontology.

Text reuse outside Wikipedia yields a tangible metric for the emerging field of quantifying Wikipedia’s influence on the web. To foster future research into these tasks, and for reproducibility’s sake, the Wikipedia text reuse corpus and the retrieval pipeline are made freely available.

URL : https://arxiv.org/abs/1812.09221

Wikipedia: an opportunity to rethink the links between sources’ credibility, trust and authority

Authors : Gilles Sahut, André Tricot

The Web and its main tools (Google, Wikipedia, Facebook, Twitter) deeply raise and renew fundamental questions, that everyone asks almost every day: Is this information or content true? Can I trust this author or source?

These questions are not new, they have been the same with books, newspapers, broadcasting and television, and, more fundamentally, in every human interpersonal communication.

This paper is focused on two scientific problems on this issue. The first one is theoretical: to address this issue, many concepts have been used in library and information sciences, communication and psychology.

The links between these concepts are not clear: sometimes two concepts are considered as synonymous, sometimes as very different. The second one is historical: sources like Wikipedia deeply challenge the epistemic evaluation of information sources, compared to previous modes of information production.

This paper proposes an integrated and simple model considering the relation between a user, a document and an author as human communication. It reduces the problem to three concepts: credibility as a characteristic granted to information depending on its truth-value; trust as the ability to produce credible information; authority when the power to influence of an author is accepted, i.e., when readers accept that the source can modify their opinion, knowledge and decisions.

The model describes also two kinds of relationships between the three concepts: an upward link and a downward link. The model is confronted with findings of empirical research on Wikipedia in particular.

URL : https://firstmonday.org/ojs/index.php/fm/article/view/7108/6555