Friday, September 5, 2008

SezHoo and WikiSym 2008

I'm proud to announce the release of SezHoo! It's a tool we created at MITRE to help establish reputations for authors in wiki sites... specifically those run on MediaWiki (the same software that runs Wikipedia).

The way that SezHoo works is that it goes through a wiki article's history and tracks the authorship of each word in the article. We use a pretty popular technique in text analysis called shingling to help us out.

For each revision of an article, we break it up into shingles. We're using shingles that are 6 words long, so each word will have 6 shingles associated with it. We credit authorship of the word to the author of the earliest shingle.

This allows us to add some information to a MediaWiki installation:

We show the authors of a page, and the percentages at which they have contributed. We also offer this information at the section level.

The information we compute is better than doing diffs with a standard MediaWiki install. Diffs are line ending based, so changing a word in a paragraph that has a single line ending can make it difficult to determine who wrote the paragraph. In addition, since we are looking for authorship of a word across all revisions at once (as opposed to a revision vs. revision comparison with a diff), this technique is immune to copying and pasting of text and reversion of vandalism.

We create a reputation for authors with 3 dimensions: quantity, quality and value. For all metrics, we assign a 5 star rating. The way this is done is by computing a raw score for all authors and then ranking them. Authors in the top 20% have a 5-star rating in that dimension.

For quantity, we simply calculate the number of word contributed by each author. This includes words that may no longer be in the current revision of the article. Value takes the percentage authorship of a page, and multiplies that by the number of page views (it's valuable if it's being read).

Quality is a lot more tricky. We assume that wiki articles are a Darwinian environment. That is, quality text will "survive" in that as the article is edited, other authors will leave the text alone. Poor quality text will be "killed" or deleted by other authors. We determine quality by looking at the percentage of text that is expected to be alive for an author after 8 revisions (you can change 8 to whatever you like, but it works well for our internal wiki). To calculate the percentage alive after 8 revisions, we used a Kaplan-Meier estimate. This was necessary due to the fact that authors will have phrases which are "alive" (they are in the current version of an article). So if a word has survived 3 edits and is in the current version of an article, when calculating survival probabilities you don't want to say that it "died" at 3 revisions, but you don't want to lose the data either. Kaplan-Meier handles that for us.

In the end, you get a 5-star rating for each author in the dimensions of quantity, quality and value.

I'll be presenting this work at WikiSym 2008. We're doing some other interesting stuff which we haven't rolled into SezHoo yet, so it's worth coming to my talk even if you've read this blog post.