Wikipedia Revision History
Name: Wikipedia Revision History
Authors: Harvard University/The Wikimedia Foundation
Brief Summary: Wikipedia retains the complete history of its article's
revisions in order to assist collaborators and moderators. This data is used
to answer questions such as: who revised? What revisions were made? When and
where were they made, and why? When viewed at the granularity of individual
pages (Wikipedia's native format) this data is linear, with each instance of
the page descending from its singular predecessor; but at coarser or finer
granularities the network is much more complex. At coarser granularities, there
are many agents (users) that operate in different ways on different pages to
produce Wikipedia as a whole. At finer granularities, these agents operate on
paragraphs, sentences, and words in an iterative, collaborative manner that
makes answering a question such as, "Where did this sentence come from?",
interesting and challenging.
It seems perfectly reasonable to wish to incorporate Wikipedia revision
histories into a provenance system, but doing so requires translating the
existing data into a format such as the OPM, and making many representational
decisions in translation. It is important to evaluate how easy it is for
different parties to incorporate non-OPM data into an OPM framework, and
whether the results of such parallel ingests can interoperate.
Scenario Diagram: Wikipedia's revision history can be represented in many ways.
We've produced a number of graphs in OPM format at a finer granularity than
Wikipedia's standard representation. In our graphs, nodes are revisions, and
ancestry is defined as, "A revision depends on the body of text that that
revision modifies." Attached, you'll find a graph of Wikipedia's "Barack Obama"
article, revisions #279136198 through #294850556. Note its principled
structure. Of course this is only one possible translation of Wikipedia to OPM.
Users: Editors, moderators, and researchers of Wikipedia.
Requirements for Provenance: Active editors of an article need to understand
what changes are being made to that article, and why, in order to contribute.
Moderators need to know what and why, as well as who and when, in order to
attribute revisions and reward and punish editors. Researchers have diverse
needs; for example, social network researchers need to understand a coarse
granularity view of Wikipedia that focuses on its users and the ways in which
they interact.
Provenance Questions: Some examples, in order of increasing complexity:
- "How has this article been edited recently?"
- "Who produced this vandalism?"
- "Where did this sentence come from?"
- "Who are the authors of this paragraph, and in what capacities?"
- "What sorts of revisions does editor X make? Which ones lead to collaborative activity, and which ones are reverted? Is editor X a habitual vandal?"
- "Which editors revise each others' work? How have these editors affected each other? What is the social network of editors as defined by their history of interactions?"
Technologies Used: The fundamental data of course comes from Wikipedia: freely
available, queryable, and large. Many users use this data in its native format,
but in addition many research groups independently ingest it into their own
systems for their own purposes.
--
PeterMacko (on behalf of Daniel Margo) - 14 May 2010
to top