Wikipedia Revision History

Name: Wikipedia Revision History

Authors: Harvard University/The Wikimedia Foundation

Brief Summary: Wikipedia retains the complete history of its article's revisions in order to assist collaborators and moderators. This data is used to answer questions such as: who revised? What revisions were made? When and where were they made, and why? When viewed at the granularity of individual pages (Wikipedia's native format) this data is linear, with each instance of the page descending from its singular predecessor; but at coarser or finer granularities the network is much more complex. At coarser granularities, there are many agents (users) that operate in different ways on different pages to produce Wikipedia as a whole. At finer granularities, these agents operate on paragraphs, sentences, and words in an iterative, collaborative manner that makes answering a question such as, "Where did this sentence come from?", interesting and challenging.

It seems perfectly reasonable to wish to incorporate Wikipedia revision histories into a provenance system, but doing so requires translating the existing data into a format such as the OPM, and making many representational decisions in translation. It is important to evaluate how easy it is for different parties to incorporate non-OPM data into an OPM framework, and whether the results of such parallel ingests can interoperate.

Scenario Diagram: Wikipedia's revision history can be represented in many ways. We've produced a number of graphs in OPM format at a finer granularity than Wikipedia's standard representation. In our graphs, nodes are revisions, and ancestry is defined as, "A revision depends on the body of text that that revision modifies." Attached, you'll find a graph of Wikipedia's "Barack Obama" article, revisions #279136198 through #294850556. Note its principled structure. Of course this is only one possible translation of Wikipedia to OPM.

A part of the revision history of Barack Obama's Wikipedia page

Users: Editors, moderators, and researchers of Wikipedia.

Requirements for Provenance: Active editors of an article need to understand what changes are being made to that article, and why, in order to contribute. Moderators need to know what and why, as well as who and when, in order to attribute revisions and reward and punish editors. Researchers have diverse needs; for example, social network researchers need to understand a coarse granularity view of Wikipedia that focuses on its users and the ways in which they interact.

Provenance Questions: Some examples, in order of increasing complexity:

"How has this article been edited recently?"
"Who produced this vandalism?"
"Where did this sentence come from?"
"Who are the authors of this paragraph, and in what capacities?"
"What sorts of revisions does editor X make? Which ones lead to collaborative activity, and which ones are reverted? Is editor X a habitual vandal?"
"Which editors revise each others' work? How have these editors affected each other? What is the social network of editors as defined by their history of interactions?"

Technologies Used: The fundamental data of course comes from Wikipedia: freely available, queryable, and large. Many users use this data in its native format, but in addition many research groups independently ingest it into their own systems for their own purposes.

-- PeterMacko (on behalf of Daniel Margo) - 14 May 2010
to top

End of topic
Skip to action links | Back to top

You are here: Main > FourthProvenanceChallenge > FourthProvenanceChallengeCFSP > WikipediaRevisionHistory

to top